为什么 @cuda.jit 内核性能随着 %timeit 运行次数的增加而恶化

问题描述 投票:0回答:0

手动添加签名时,该函数在 Google Colab 上运行需要更多时间

import numpy as np
from numba import cuda, float64
import math

@cuda.jit
def matrix_multiplication(A, B, C):
  i, k = cuda.grid(2)
  m, n = A.shape
  _, p = B.shape
  if i < m and k < p:
    C[i, k] = 0
    for j in range(n):
      C[i, k] += A[i, j] * B[j, k]

@cuda.jit((float64[:,::1], float64[:,::1], float64[:,::1]))
def matrix_multiplication2(A, B, C):
  i, k = cuda.grid(2)
  m, n = A.shape
  _, p = B.shape
  if i < m and k < p:
    C[i, k] = 0
    for j in range(n):
      C[i, k] += A[i, j] * B[j, k]
    
m = 1000
n = 1000
p = 1000
A = np.random.randn(m, n)
B = np.random.randn(n, p)
C = np.empty((m, p))

A_gpu = cuda.to_device(A)
B_gpu = cuda.to_device(B)
C_gpu = cuda.to_device(C)
threads_per_block = (16, 16)
blocks_per_grid = (math.ceil(C.shape[0]/threads_per_block[0]), math.ceil(C.shape[1]/threads_per_block[1]))

print(matrix_multiplication.signatures)
%timeit -r2 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
print(matrix_multiplication.signatures)

print(matrix_multiplication2.signatures)
%timeit -r2 matrix_multiplication2[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
print(matrix_multiplication2.signatures)

%timeit A @ B
[]
190 µs ± 66.7 µs per loop (mean ± std. dev. of 2 runs, 1 loop each)
[(array(float64, 2d, C), array(float64, 2d, C), array(float64, 2d, C))]

[(array(float64, 2d, C), array(float64, 2d, C), array(float64, 2d, C))]
60.9 ms ± 842 µs per loop (mean ± std. dev. of 2 runs, 1000 loops each)
[(array(float64, 2d, C), array(float64, 2d, C), array(float64, 2d, C))]

53.5 ms ± 1.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

细胞需要 3 分 19 秒才能运行。

我只跑了 2 次

-r2
因为程序好像挂了。 这对于 numpy 来说是没有必要的,即使是艰难的
timeit
也报告了大约 50ms
的相似时间。 (我刚刚注意到“每个 1000 个循环”。此外,
matrix_multiplication
只使用了 1 个循环,即使它更快并且由于编译时间而有更多差异)为了确认,内核是否在 GPU 发送一些后“返回”当它完成时向 CPU 发出信号? (发现 其他人 使用 %timeit 以确保它可以代替分析器使用,并且我们不需要调用同步方法)

我确保添加

::1
以指定数组是continuous。基于this,我们应该可以提供一个签名


当我在

matrix_multiplication2
计时时停止执行时发生了一些奇怪的事情。下一次运行,光标在函数定义上花费了大约 50s
matrix_multiplication
也变慢了
58.6 ms ± 70.8 µs per loop (mean ± std. dev. of 2 runs, 1000 loops each)
.

我能够在 2 个不同的帐户上复制它几次。有时,简单地再次运行单元格会使

matrix_multiplication
变慢。

如果我指定循环次数,我会收到

slowest run took
警告。

%timeit -r2 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
The slowest run took 818.31 times longer than the fastest. This could mean that an intermediate result is being cached, which makes sense since the function needs to be compiled the first time.
123 ms ± 123 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

不知道为什么以前没有,连tough 1 loop也用了。 (我确保删除运行时并运行一个单独的单元格以防缓存某些内容)

%timeit -r2 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
259 µs ± 115 µs per loop (mean ± std. dev. of 2 runs, 1 loop each)

奇怪的是,我还得到了第二个函数的警告,它已经编译了。少于 5 次运行时,我无法始终如一地得到它。

%timeit -r5 -n1 matrix_multiplication2[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
The slowest run took 4.28 times longer than the fastest. This could mean that an intermediate result is being cached.
98.3 µs ± 69.8 µs per loop (mean ± std. dev. of 5 runs, 1 loop each)

未指定任何内容时花费了 7m 11s。

%timeit matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit matrix_multiplication2[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit A @ B
The slowest run took 4.63 times longer than the fastest. This could mean that an intermediate result is being cached.
104 µs ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
59.9 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1000 loops each)
48.1 ms ± 10.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我试过 this 工具来查看运行 1000 个循环是否会淹没 CPU 和 GPU 之间的通信桥梁。我认为 CPU 使用率应该接近 100%,因为它处于等待循环中,但我想确认它略低于 100%,因此 CPU 没有完全不堪重负。

运行 500 次后性能开始恶化

%timeit -r1 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu) # compile
%timeit -r10 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit -r100 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit -r500 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit -r1000 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
%timeit -r2000 -n1 matrix_multiplication[blocks_per_grid, threads_per_block](A_gpu, B_gpu, C_gpu)
1.16 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
141 µs ± 62.2 µs per loop (mean ± std. dev. of 10 runs, 1 loop each)
The slowest run took 15.61 times longer than the fastest. This could mean that an intermediate result is being cached.
123 µs ± 133 µs per loop (mean ± std. dev. of 100 runs, 1 loop each)
The slowest run took 6.99 times longer than the fastest. This could mean that an intermediate result is being cached.
123 µs ± 34 µs per loop (mean ± std. dev. of 500 runs, 1 loop each)
The slowest run took 1741.94 times longer than the fastest. This could mean that an intermediate result is being cached.
32.7 ms ± 27.4 ms per loop (mean ± std. dev. of 1000 runs, 1 loop each)
59.1 ms ± 1.54 ms per loop (mean ± std. dev. of 2000 runs, 1 loop each)
cuda numba
© www.soinside.com 2019 - 2024. All rights reserved.