警告:%%timeit 在每个循环之前不运行设置代码 | CPU 内核启动开销随线程数而变化

问题描述 投票:0回答:1

在 Google Colab 中将鼠标悬停在

timeit
上对我来说并不明显

cell模式下,第一行的语句作为设置代码 (已执行但未定时)并且单元格的主体已定时。细胞 body 可以访问在设置代码中创建的任何变量。

我没有找到明确说明的地方,即使回想起来很有意义,所以我想我会分享它。如果我们需要关闭某些连接或释放资源,这是相关的。

原题:CPU内核启动开销随线程数变化

from numba import cuda, float64
import numpy as np
import math

@cuda.jit((float64[:,::1],))
def add_one(A):
  x, y = cuda.grid(2)
  m, n = A.shape
  if x < m and y < n: 
    A[x, y] += 1

n = 10000 # larger for more threads
A = np.zeros((n, n))

# define the number of threads in a block
threads_per_block = (16, 16)
blocks_per_grid_x = math.ceil(A.shape[0] / threads_per_block[0]) 
blocks_per_grid_y = math.ceil(A.shape[1] / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

首先,我想确认

to_device()
在不提供流时是异步的,因为我将在每次运行之间调用它以将数组重置为 0。

%timeit cuda.to_device(A)
141 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit cuda.synchronize() # empty queue (doesn't seem to make a difference)
cuda.to_device(A)
140 ms ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
stream = cuda.stream()
%timeit cuda.to_device(A, stream=stream)
135 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit cuda.to_device(A)
# would take longer if to_device was asynchronous and work was delegated
cuda.synchronize()
10.4 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit cuda.to_device(A, stream=stream)
cuda.synchronize()
9.59 µs ± 240 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

我很困惑为什么仅仅将传输排入流就需要 135 毫秒。使用

n=1000
,两种情况都需要 1.86 毫秒。

%timeit cuda.synchronize()
11 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

11 µs 的同步延迟不是很高吗?基于

查询基本上是手动检查用于轮询等待的 32 位内存位置;所以在大多数情况下,它们非常便宜。 - 来源

我还以为可以忽略不计

# don't time setup code
%%timeit A_gpu = cuda.to_device(A); cuda.synchronize(); add_one[blocks_per_grid, threads_per_block](A_gpu)
# cpu queue start + gpu launch, execution time + synchronize latency
cuda.synchronize() 
11.7 µs ± 2.49 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

加上执行时间,结果和单独调用

synchronize
时几乎一样。

# runs for > 10 min for n = 10000 (If I specify -r10 -n100, the result is smaller 41.5 µs ± 11.3 µs), a few hours for n=30000, maybe something I didn't account for is causing timeit to not stop earlier
%%timeit cuda.synchronize(); A_gpu = cuda.to_device(A); cuda.synchronize(); # call again to make sure queue is empty
# cpu kernel launch overhead = adding to queue
add_one[blocks_per_grid, threads_per_block](A_gpu)
%%timeit A_gpu = cuda.to_device(A); cuda.synchronize()
# total time
add_one[blocks_per_grid, threads_per_block](A_gpu)
cuda.synchronize()

cpu launch overhead 随线程数(或gpu中存储的数组大小)变化是否正常?我认为如果它只是将任务添加到队列中并不重要。另外,我很困惑为什么 cpu 启动开销高于 gpu 启动开销(+ 执行时间 + 同步延迟),这涉及创建和协调线程。


矩阵乘法与相当大的数组的执行时间高于同步延迟。

import numpy as np
from numba import cuda, float64
import math

@cuda.jit((float64[:,:], float64[:,:], float64[:,:]))
def matrix_multiplication(A, B, C):
  i, k = cuda.grid(2)
  m, n = A.shape
  _, p = B.shape
  if i < m and k < p:
    C[i, k] = 0
    for j in range(n):
      C[i, k] += A[i, j] * B[j, k]

m = 1500
n = 1000
p = 1000
A = np.random.randn(m, n)
B = np.random.randn(n, p)

A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.device_array((m, p))
threads_per_block = (16, 16)
blocks_per_grid = (math.ceil(C_gpu.shape[0]/threads_per_block[0]), math.ceil(C_gpu.shape[1]/threads_per_block[1]))
%%timeit matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
cuda.synchronize()
24.7 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
cuda numba
1个回答
0
投票

作为健全性检查,我确保在每次循环之前调用设置代码

cuda.synchronize()
,而不仅仅是在每次运行之前。

import time
%%timeit -r2 -n5 print('a')
time.sleep(1)
print('b')
time.sleep(1)
a
b
b
b
b
b
a
b
b
b
b
b
2 s ± 7.06 µs per loop (mean ± std. dev. of 2 runs, 5 loops each)

确保使用 1 个循环。

%%timeit -r1000 -n1 cuda.synchronize(); A_gpu = cuda.to_device(A); cuda.synchronize()
# cpu kernel launch overhead = adding to queue
add_one[blocks_per_grid, threads_per_block](A_gpu)

我猜测第一次运行速度较慢,因为 gpu 需要根据

blocks_per_grid
threads_per_block
进行一些设置,因此出现“最慢运行”警告,但在后续运行中重复使用该设置。 cpu 内核启动开销增加了一点。

%%timeit -r50 -n1 A_gpu = cuda.to_device(A); cuda.synchronize(); add_one[blocks_per_grid, threads_per_block](A_gpu)
# cpu queue start + gpu launch, execution time + synchronize latency
cuda.synchronize() 

同步需要比预期更长的时间。对于 n = 1000,延迟开始变得可以忽略不计。

cuda.to_device
确实需要同步,但大部分开销都在cpu上。

%%timeit -r100 -n1 cuda.to_device(A) # n = 30000
cuda.synchronize()
87.1 µs ± 12.4 µs per loop (mean ± std. dev. of 100 runs, 1 loop each)
%%timeit -r100 -n1 cuda.synchronize()
cuda.to_device(A)
1.21 s ± 44.5 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)

矩阵乘法的时间更长,因为每次运行调用一次

matrix_multiplication
,现在每次运行都有一个循环。

%%timeit -r100 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
cuda.synchronize()
86.7 ms ± 8.84 ms per loop (mean ± std. dev. of 100 runs, 1 loop each)

对于这种更复杂的情况,cpu 内核启动开销更高。

# m = n = p = 10000
%%timeit -r10 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
cuda.synchronize()
1min 17s ± 50.7 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%%timeit -r10 -n1 cuda.synchronize()
matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
362 µs ± 80.5 µs per loop (mean ± std. dev. of 10 runs, 1 loop each)

配置

[blocks_per_grid, threads_per_block]
与n=10000的
add_one
情况相同,但参数更多。

© www.soinside.com 2019 - 2024. All rights reserved.