将 HDF5 转换为 Parquet 而不加载到内存中

问题描述 投票:0回答:2

我有一个以 HDF5 格式存储的大型数据集(~600 GB)。由于它太大而无法放入内存,我想将其转换为 Parquet 格式并使用 pySpark 执行一些基本的数据预处理(标准化、查找相关矩阵等)。但是,我不确定如何将整个数据集转换为 Parquet,而不将其加载到内存中。

我查看了这个要点:https://gist.github.com/jiffyclub/905bf5e8bf17ec59ab8f#file-hdf_to_parquet-py,但似乎整个数据集都被读入内存。

我想到的一件事是分块读取 HDF5 文件并将其增量保存到 Parquet 文件中:

test_store = pd.HDFStore('/path/to/myHDFfile.h5')
nrows = test_store.get_storer('df').nrows
chunksize = N
for i in range(nrows//chunksize + 1):
    # convert_to_Parquet() ...

但是我找不到任何可以让我逐步构建 Parquet 文件的文档。有进一步阅读的链接吗?

python pandas hdf5 parquet hdf
2个回答
19
投票

您可以使用 pyarrow 为此!

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)

    parquet_writer.close()

-1
投票

感谢您的回答,我尝试从 CLI 调用下面的 py 脚本,但它既没有显示任何错误,也看不到转换后的镶木地板文件。

h5 文件也不为空。enter image description here

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

h5_file = "C:\\Users\...\tall.h5"
parquet_file = "C:\\Users\...\my.parquet"

def convert_hdf5_to_parquet(h5_file, parquet_file, chunksize=100000):

    stream = pd.read_hdf(h5_file, chunksize=chunksize)

    for i, chunk in enumerate(stream):
        print("Chunk {}".format(i))
        print(chunk.head())

        if i == 0:
            # Infer schema and open parquet file on first chunk
            parquet_schema = pa.Table.from_pandas(df=chunk).schema
            parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')

        table = pa.Table.from_pandas(chunk, schema=parquet_schema)
        parquet_writer.write_table(table)
    parquet_writer.close()
© www.soinside.com 2019 - 2024. All rights reserved.