我有这个简单的 python/numba 代码:
from numba import njit
import numba as nb
@nb.njit(nb.uint64(nb.uint64))
def popcount(x):
b=0
while(x > 0):
x &= x - nb.uint64(1)
b+=1
return b
@njit
def timed_loop(n):
summand = 0
for i in range(n):
summand += popcount(i)
return summand
它只是将整数 0 到 n - 1 的 popcount 相加。
当我计时时,我得到:
%timeit timed_loop(1000000)
340 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
事实证明 llvm 巧妙地将 popcount 函数转换为本机 CPU POPCNT 指令,因此我们应该期望它很快。但问题是,有多快。
我想将它与 C 版本进行比较,看看速度差异。
#include <stdio.h>
#include <time.h>
// Function to calculate the population count (number of set bits) of an integer using __builtin_popcount
int popcount(int num) {
return __builtin_popcount(num);
}
int main() {
unsigned int n;
printf("Enter the value of n: ");
scanf("%d", &n);
// Variables to store start and end times
struct timespec start_time, end_time;
// Get the current time as the start time
clock_gettime(CLOCK_MONOTONIC, &start_time);
int sum = 0;
for (unsigned int i = 0; i < n; i++) {
sum += popcount(i);
}
// Get the current time as the end time
clock_gettime(CLOCK_MONOTONIC, &end_time);
// Calculate the elapsed time in microseconds
long long elapsed_time = (end_time.tv_sec - start_time.tv_sec) * 1000000LL +
(end_time.tv_nsec - start_time.tv_nsec) / 1000;
printf("Sum of population counts from 0 to %d-1 is: %d\n", n, sum);
printf("Elapsed time: %lld microseconds\n", elapsed_time);
return 0;
}
然后我用 -march=native -Ofast
编译了这个。我尝试了 gcc 和 clang,结果非常相似。
./popcount
Enter the value of n: 1000000
Sum of population counts from 0 to 1000000-1 is: 9884992
Elapsed time: 732 microseconds
为什么 numba 的速度是 C 代码的两倍?
nanobench 基准测试库对你的代码进行了基准测试:
#define ANKERL_NANOBENCH_IMPLEMENT
#include <nanobench.h>
// Function to calculate the population count (number of set bits) of an integer using __builtin_popcount
int popcount(int num) {
return __builtin_popcount(num);
}
int main() {
unsigned int n = 1000000;
int sum = 0;
ankerl::nanobench::Bench().minEpochIterations(100).run("popcount bench", [&] {
for (unsigned int i = 0; i < n; i++) {
sum += popcount(i);
}
ankerl::nanobench::doNotOptimizeAway(sum);
});
return 0;
}
编译它
g++ -Ofast -march=native -I./nanobench/src/include/ ./main.cpp
这给了我这个输出:
| ns/op | op/s | err% | total | benchmark
|--------------------:|--------------------:|--------:|----------:|:----------
| 216,476.86 | 4,619.43 | 0.1% | 0.26 | `popcount bench`
因此,如果我正确解释结果,在一秒钟内我们会执行 4619 次操作(一次迭代约 216us)
numba
+
timeit
:
import numba as nb
from numba import njit
@nb.njit(nb.uint64(nb.uint64))
def popcount(x):
b = 0
while x > 0:
x &= x - nb.uint64(1)
b += 1
return b
@njit
def timed_loop(n):
summand = 0
for i in range(n):
summand += popcount(i)
return summand
# warm the cache
timed_loop(1)
from timeit import timeit
t = timeit("timed_loop(n)", setup="n = 1000000", globals=globals(), number=1000)
print(t)
打印:
0.13498791516758502
因此 1000 次迭代需要 0.135 秒,1 次迭代大约需要 135 秒。所以是的,您的观察是正确的。看来 numba 快了 2 倍。