为什么pandas.groupby.mean所以比并行实现更快

Question

我用的是熊猫GROUPBY的意思是像一个非常大的数据集以下功能：

import pandas as pd
df=pd.read_csv("large_dataset.csv")
df.groupby(['variable']).mean()

它看起来像功能没有使用多线程处理，因此，我实现了一个并联的版本：

import pandas as pd 
from multiprocessing import Pool, cpu_count 

def meanFunc(tmp_name, df_input): 
    df_res=df_input.mean().to_frame().transpose()
    return df_res 

def applyParallel(dfGrouped, func):
    num_process=int(cpu_count())
    with Pool(num_process) as p: 
        ret_list=p.starmap(func, [[name, group] for name, group in dfGrouped])
    return pd.concat(ret_list)

applyParallel(df.groupby(['variable']), meanFunc)

然而，似乎大熊猫实施仍比我的并行实现方式更快。

我在看的source code大熊猫GROUPBY，我看到它使用用Cython。那是什么原因呢？

def _cython_agg_general(self, how, alt=None, numeric_only=True,
                        min_count=-1):
    output = {}
    for name, obj in self._iterate_slices():
        is_numeric = is_numeric_dtype(obj.dtype)
        if numeric_only and not is_numeric:
            continue

        try:
            result, names = self.grouper.aggregate(obj.values, how,
                                                   min_count=min_count)
        except AssertionError as e:
            raise GroupByError(str(e))
        output[name] = self._try_cast(result, obj)

    if len(output) == 0:
        raise DataError('No numeric types to aggregate')

    return self._wrap_aggregated_output(output, names)

Answer 1

简短的回答 - 如果你想对这些案件类型并行使用dask。你有你的方式，它避免陷阱。它仍然可能不会更快，但会给你最好的射手，是大熊猫在很大程度上直接替代。

更长的答案

1）平行固有增加了开销，所以最好你并联操作是比较昂贵的。加起来的数字是不是特别 - 你说得对，用Cython用在这里，你看代码是分派逻辑。实际的核心用Cython是here，转化到一个非常简单的C环。

2）您使用多处理 - 这意味着每个过程中需要采取的数据的副本。这是昂贵的。通常情况下，你必须这样做，因为在GIL的蟒蛇 - 实际上，你可以（和DASK一样）这里使用线程，因为大熊猫操作C和释放GIL。

3）@AKX在评论中指出的那样 - 你并行（... name, group in dfGrouped之前迭代）也是比较昂贵的 - 每个组的建设新的子数据帧。原来熊猫算法遍历到位的数据。

为什么pandas.groupby.mean所以比并行实现更快

问题描述投票：2回答：1

1个回答

最新问题

为什么pandas.groupby.mean所以比并行实现更快

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1