并行使用Python的命令将多个CSV分别读取到不起作用的数据帧列表中

Question

我有一种情况，我需要从S3中读取多个CSV，并将每个CSV作为数据帧分别存储在数据帧列表中。当我一一阅读每个CSV时，它可以工作。我试图并行读取它们以加快速度，并尝试在此answer中重新创建并行过程。但是，当我这样做时，该过程挂起。可能是什么问题？ dask中是否有某些内容无法正常工作？

# Load libraries
import pandas as pd
import dask.dataframe as dd
from multiprocessing import Pool

# Define function    
def read_csv(table):
    path = 's3://my-bucket/{}/*.csv'.format(table)
    df = dd.read_csv(path, assume_missing=True).compute()
    return df

# Define tables
tables = ['sales', 'customers', 'inventory']

# Run function to read one-by-one (this works)
df_list = []
for t in tables:
    df_list.append(read_csv(t))

# Try to run function in parallel (this hangs, never completes)
with Pool(processes=3) as pool:
    df_list = pool.map(read_csv, tables)

Answer 1

奇怪的是，您试图将Dask嵌套在另一个并行解决方案中。这很可能导致性能欠佳。相反，如果您要使用进程，建议您将Dask的默认调度程序更改为多处理，然后正常使用dd.read_csv。

dfs = [dd.read_csv(...) for table in tables]
dfs = dask.compute(dfs, scheduler="processes")

有关Dask调度程序的更多信息，请参见https://docs.dask.org/en/latest/scheduling.html

并行使用Python的命令将多个CSV分别读取到不起作用的数据帧列表中

问题描述投票：1回答：1

1个回答

最新问题

并行使用Python的命令将多个CSV分别读取到不起作用的数据帧列表中

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1