pandas 在 groupby 操作中为什么比纯 C 快？

Question

我有一个 nparray 的 x,y 对与形状

(n,2)

，并且肯定知道对于每个 x 有多个 y 的值，我想计算每个 y 的平均值独一无二的x。我突然想到，这需要一个

groupby

操作，然后是一个

mean

，这在 pandas 库中可用。然而，考虑到 pandas 由于我的数据规模（超过一百万点）而变慢，我在 C 中编写了一个简单的程序，并使用 ctypes 调用 C 函数并执行操作。我使用了

-fPIC

并用

shared object

编译了一个

GCC MinGW

文件。

int average( int* array , int size_array , int* unique , int size_unique , float* avg ){

    if (size_array % 2 != 0){
        return 1;
    }

    for (int i = 0 ; i < size_unique ; i++){

        int curX = unique[i];
        int sum = 0;
        int count = 0;

        for (int j = 0 ; j < size_array ; j += 2){
            if ( array[j] == curX ){
                sum += (array[j+1]);
                count += 1;
            }
        }

        float average = ((float)sum / (float)count);

        avg[i] = average;

    }

    return 0;

}

后来，因为程序仍然很慢（大约需要 1.5 秒），我试了一下 pandas，我惊奇地发现它使用它的速度有多快。它几乎是我用 C 编写的程序的两倍。但它对我来说没有任何意义。他们是如何达到这种水平的表现的？ pandas 使用哈希表吗？

ar = np.random.randint(0,2000,size = (40000,2))
df = pd.DataFrame({'x': ar[:,0], 'y': ar[:,1]})
df = df.groupby('y', as_index=False)['x'].mean()
x = df[['x']].to_numpy()
y = df[['y']].to_numpy()

我计算并发现，对于大小为

(40000,2)

和

独特元素的数组，我有大约

80,000,000

操作在不到

0.2s

的时间内完成。所以每个操作大约需要 2.5 纳秒，这接近我处理器的极限（我有一个 3.5Ghz 四核 CPU - intel i7 4720HQ）。所以我正在使用 C 代码推动 CPU。大熊猫怎么会把它推得更远？

Answer 1

你的 C 代码效率不高。它是 O(N * M)，其中 N 是

size_array

，M 是

size_unique

。

这是一种做得更好的方法：

1. Sort unique  (make a copy if `unique` is not allowed to change)

2. Make an array TMP with size_unique elements of a struct with two ints, i.e. sum and count

3. Iterate your data array

   a. Find array[i] in sorted `unique` using binary search and save the index

   b. Use the index to update TMP[index]


4. Iterate TMP and calculate average

N 远大于 M，最坏情况可以近似为 O(N*LogM)。比 O(N * M) 好多了。

可能有更有效的方法来做到这一点。我的观点是你的 pandas lib 可能优化了很多，而你的 C 代码没有，所以你不能真正比较它们。

pandas 在 groupby 操作中为什么比纯 C 快？

问题描述投票：0回答：1

1个回答

最新问题

pandas 在 groupby 操作中为什么比纯 C 快？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1