将多个 csv.gz 文件读取到 dask 数据帧中

问题描述 投票:0回答:1

我有多个 .csv.gz 文件,我正在尝试将其读入 dask 数据帧,我能够使用此代码实现此目的:

file_paths = glob.glob(file_pattern)
@delayed
def read_csv(file_paths):
    return dd.read_csv(file_paths, compression='gzip', blocksize=None,dtype=None)

dfs=[delayed(pd.read_csv)(fn) for fn in file_paths]
df = dd.from_delayed(dfs)



The problem is that when i tried converting the dask dataframe into pandas dataframe  using
`df=df.compute()`

I get the error message:
"EmptyDataError: No columns to parse from file"
I would really appreciate any help with this
python pandas dask dask-dataframe dask-delayed
1个回答
0
投票

以下对我有用:

import os
import pandas as pd
import dask.dataframe as dd
file_path=r"C:\Users\John Doe\Downloads\checking gz"

dfs=[]
files=os.listdir(file_path)
for file in files:
    if '.gz' in file:
        df=dd.read_csv(file_path+'/'+file, compression='gzip',blocksize=None,error_bad_lines =False)
        dfs.append(df)
        print(df)
        
new_df=dd.concat(dfs)
pd_df=new_df.compute()
© www.soinside.com 2019 - 2024. All rights reserved.