从 S3 读取镶木地板的最快方法

Question

我在 AWS S3 中有一个 Parquet 文件。我想将其读入 Pandas DataFrame 中。我有两种方法可以实现这一目标。

1)
import pyarrow.parquet as pq
table = pq.read_table("s3://tpc-h-parquet/lineitem/part0.snappy.parquet") (takes 1 sec)
pandas_table = table.to_pandas() ( takes 1 sec !!! )
2)
import pandas as pd
table = pd.read_parquet("s3://tpc-h-parquet/lineitem/part0.snappy.parquet") (takes 2 sec)

我怀疑选项 2 实际上只是在幕后执行选项 1。

将 Parquet 文件读入 Pandas 的最快方法是什么？

Answer 1

你是对的。选项 2 只是选项 1 的幕后黑手。

将 Parquet 文件读入 Pandas 的最快方法是什么？

选项 1 和选项 2 可能都足够好了。然而，如果你想去掉每一点，你可能需要更深一层，具体取决于你的 pyarrow 版本。事实证明，选项 1 实际上也只是一个代理，在本例中是数据集 API 的代理：

import pyarrow.dataset as ds
dataset = ds.dataset("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")
table = dataset.to_table(use_threads=True)
df = table.to_pandas()

对于 pyarrow 版本 >= 4 和 < 7 you can usually get slightly better performance on S3 using the asynchronous scanner:

import pyarrow.dataset as ds
dataset = ds.dataset("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")
table = dataset.to_table(use_threads=True, use_async=True)
df = table.to_pandas()

在 pyarrow 版本 7 中，异步扫描器是默认设置，因此您可以再次简单地使用

pd.read_parquet("s3://tpc-h-parquet/lineitem/part0.snappy.parquet")

从 S3 读取镶木地板的最快方法

问题描述投票：0回答：1

1个回答

最新问题

从 S3 读取镶木地板的最快方法

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1