为什么 numba popcount 代码比同等 C 代码快两倍？

Question

我有这个简单的 python/numba 代码：

from numba import njit
import numba as nb
@nb.njit(nb.uint64(nb.uint64))
def popcount(x): 
      b=0
      while(x > 0):
          x &= x - nb.uint64(1)   
          b+=1
      return b
@njit
def timed_loop(n):
    summand = 0
    for i in range(n):
        summand += popcount(i)
    return summand

它只是将整数 0 到 n - 1 的 popcount 相加。

当我计时时，我得到：

%timeit timed_loop(1000000)
340 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

事实证明 llvm 巧妙地将 popcount 函数转换为本机 CPU POPCNT 指令，因此我们应该期望它很快。但问题是，有多快。

我想将它与 C 版本进行比较，看看速度差异。

 #include <stdio.h>
#include <time.h>

// Function to calculate the population count (number of set bits) of an integer using __builtin_popcount
int popcount(int num) {
    return __builtin_popcount(num);
}

int main() {
    unsigned int n;
    printf("Enter the value of n: ");
    scanf("%d", &n);

    // Variables to store start and end times
    struct timespec start_time, end_time;

    // Get the current time as the start time
    clock_gettime(CLOCK_MONOTONIC, &start_time);

    int sum = 0;
    for (unsigned int i = 0; i < n; i++) {
        sum += popcount(i);
    }

    // Get the current time as the end time
    clock_gettime(CLOCK_MONOTONIC, &end_time);

    // Calculate the elapsed time in microseconds
    long long elapsed_time = (end_time.tv_sec - start_time.tv_sec) * 1000000LL +
                            (end_time.tv_nsec - start_time.tv_nsec) / 1000;

    printf("Sum of population counts from 0 to %d-1 is: %d\n", n, sum);
    printf("Elapsed time: %lld microseconds\n", elapsed_time);

    return 0;
}

然后我用

-march=native -Ofast

 编译了这个。我尝试了 gcc 和 clang，结果非常相似。

./popcount 
Enter the value of n: 1000000
Sum of population counts from 0 to 1000000-1 is: 9884992
Elapsed time: 732 microseconds

为什么 numba 的速度是 C 代码的两倍？

Answer 1

我使用

nanobench 基准测试库对你的代码进行了基准测试：

#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>

// Function to calculate the population count (number of set bits) of an integer using __builtin_popcount
int popcount(int num) {
    return __builtin_popcount(num);
}

int main() {
    unsigned int n = 1000000;
    int sum = 0;

    ankerl::nanobench::Bench().minEpochIterations(100).run("popcount bench", [&] {
        for (unsigned int i = 0; i < n; i++) {
            sum += popcount(i);
        }
        ankerl::nanobench::doNotOptimizeAway(sum);
    });

    return 0;
}

编译它

g++ -Ofast -march=native -I./nanobench/src/include/ ./main.cpp

这给了我这个输出：

|               ns/op |                op/s |    err% |     total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
|          216,476.86 |            4,619.43 |    0.1% |      0.26 | `popcount bench`

因此，如果我正确解释结果，在一秒钟内我们会执行 4619 次操作（一次迭代约 216us）

相比之下，使用

numba

timeit

import numba as nb
from numba import njit


@nb.njit(nb.uint64(nb.uint64))
def popcount(x):
    b = 0
    while x > 0:
        x &= x - nb.uint64(1)
        b += 1
    return b


@njit
def timed_loop(n):
    summand = 0
    for i in range(n):
        summand += popcount(i)
    return summand


# warm the cache
timed_loop(1)

from timeit import timeit

t = timeit("timed_loop(n)", setup="n = 1000000", globals=globals(), number=1000)
print(t)

打印：

0.13498791516758502

因此 1000 次迭代需要 0.135 秒，1 次迭代大约需要 135 秒。

所以是的，您的观察是正确的。看来 numba 快了 2 倍。

我的规格：Python 3.11/Ubuntu 20.04/AMD 5700x/g++ 9.4.0

为什么 numba popcount 代码比同等 C 代码快两倍？

问题描述投票：0回答：1

1个回答

最新问题

为什么 numba popcount 代码比同等 C 代码快两倍？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1