将多个 csv.gz 文件读取到 dask 数据帧中

Question

我有多个 .csv.gz 文件，我正在尝试将其读入 dask 数据帧，我能够使用此代码实现此目的：

file_paths = glob.glob(file_pattern)
@delayed
def read_csv(file_paths):
    return dd.read_csv(file_paths, compression='gzip', blocksize=None,dtype=None)

dfs=[delayed(pd.read_csv)(fn) for fn in file_paths]
df = dd.from_delayed(dfs)



The problem is that when i tried converting the dask dataframe into pandas dataframe  using
`df=df.compute()`

I get the error message:
"EmptyDataError: No columns to parse from file"
I would really appreciate any help with this

Answer 1

以下对我有用：

import os
import pandas as pd
import dask.dataframe as dd
file_path=r"C:\Users\John Doe\Downloads\checking gz"

dfs=[]
files=os.listdir(file_path)
for file in files:
    if '.gz' in file:
        df=dd.read_csv(file_path+'/'+file, compression='gzip',blocksize=None,error_bad_lines =False)
        dfs.append(df)
        print(df)
        
new_df=dd.concat(dfs)
pd_df=new_df.compute()

将多个 csv.gz 文件读取到 dask 数据帧中

问题描述投票：0回答：1

1个回答

最新问题

将多个 csv.gz 文件读取到 dask 数据帧中

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1