为什么 numba popcount 代码比同等 C 代码快两倍?

问题描述 投票:0回答:1

我有这个简单的 python/numba 代码:

from numba import njit
import numba as nb
@nb.njit(nb.uint64(nb.uint64))
def popcount(x): 
      b=0
      while(x > 0):
          x &= x - nb.uint64(1)   
          b+=1
      return b
@njit
def timed_loop(n):
    summand = 0
    for i in range(n):
        summand += popcount(i)
    return summand

它只是将整数 0 到 n - 1 的 popcount 相加。

当我计时时,我得到:

%timeit timed_loop(1000000)
340 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

事实证明 llvm 巧妙地将 popcount 函数转换为本机 CPU POPCNT 指令,因此我们应该期望它很快。但问题是,有多快。

我想将它与 C 版本进行比较,看看速度差异。

#include <stdio.h> #include <time.h> // Function to calculate the population count (number of set bits) of an integer using __builtin_popcount int popcount(int num) { return __builtin_popcount(num); } int main() { unsigned int n; printf("Enter the value of n: "); scanf("%d", &n); // Variables to store start and end times struct timespec start_time, end_time; // Get the current time as the start time clock_gettime(CLOCK_MONOTONIC, &start_time); int sum = 0; for (unsigned int i = 0; i < n; i++) { sum += popcount(i); } // Get the current time as the end time clock_gettime(CLOCK_MONOTONIC, &end_time); // Calculate the elapsed time in microseconds long long elapsed_time = (end_time.tv_sec - start_time.tv_sec) * 1000000LL + (end_time.tv_nsec - start_time.tv_nsec) / 1000; printf("Sum of population counts from 0 to %d-1 is: %d\n", n, sum); printf("Elapsed time: %lld microseconds\n", elapsed_time); return 0; }
然后我用 

-march=native -Ofast

 编译了这个。我尝试了 gcc 和 clang,结果非常相似。

./popcount Enter the value of n: 1000000 Sum of population counts from 0 to 1000000-1 is: 9884992 Elapsed time: 732 microseconds
为什么 numba 的速度是 C 代码的两倍?

python c performance numba
1个回答
0
投票
我使用

nanobench 基准测试库对你的代码进行了基准测试:

#define ANKERL_NANOBENCH_IMPLEMENT #include <nanobench.h> // Function to calculate the population count (number of set bits) of an integer using __builtin_popcount int popcount(int num) { return __builtin_popcount(num); } int main() { unsigned int n = 1000000; int sum = 0; ankerl::nanobench::Bench().minEpochIterations(100).run("popcount bench", [&] { for (unsigned int i = 0; i < n; i++) { sum += popcount(i); } ankerl::nanobench::doNotOptimizeAway(sum); }); return 0; }
编译它

g++ -Ofast -march=native -I./nanobench/src/include/ ./main.cpp
这给了我这个输出:

| ns/op | op/s | err% | total | benchmark |--------------------:|--------------------:|--------:|----------:|:---------- | 216,476.86 | 4,619.43 | 0.1% | 0.26 | `popcount bench`
因此,如果我正确解释结果,在一秒钟内我们会执行 4619 次操作(一次迭代约 216us)


相比之下,使用

numba

 + 
timeit
:

import numba as nb from numba import njit @nb.njit(nb.uint64(nb.uint64)) def popcount(x): b = 0 while x > 0: x &= x - nb.uint64(1) b += 1 return b @njit def timed_loop(n): summand = 0 for i in range(n): summand += popcount(i) return summand # warm the cache timed_loop(1) from timeit import timeit t = timeit("timed_loop(n)", setup="n = 1000000", globals=globals(), number=1000) print(t)
打印:

0.13498791516758502
因此 1000 次迭代需要 0.135 秒,1 次迭代大约需要 135 秒。

所以是的,您的观察是正确的。看来 numba 快了 2 倍。


我的规格:Python 3.11/Ubuntu 20.04/AMD 5700x/g++ 9.4.0

© www.soinside.com 2019 - 2024. All rights reserved.