2D Cuda 网格内核中的 Cupy 索引？

Question

我正在尝试开始使用 Cupy 进行一些 Cuda 编程。我需要编写自己的内核。然而，我在 2D 内核方面遇到了困难。看来 Cupy 并没有按照我的预期工作。这是 Numba Cuda 中 2D 内核的一个非常简单的示例：

import cupy as cp
from numba import cuda

@cuda.jit
def nb_add_arrs(x1, x2, y):
  i, j = cuda.grid(2)
  if i < y.shape[0] and j < y.shape[1]:
    y[i, j] = x1[i, j] + x2[i, j]

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
nb_add_arrs[bpg, tpb](x1, x2, y)

结果正如预期的那样：

y
[[2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]
 [2 2 2 2 2]]

但是，当我尝试在 Cupy 中执行这个简单的内核时，我没有得到相同的结果。

cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(const float* x1, const float* x2, float* y, int N){
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  int j = blockDim.y * blockIdx.y + threadIdx.y;

  if(i < N && j < N){
    y[i, j] = x1[i, j] + x2[i, j];
  }
}
''', 'add_arrs')

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))

y
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]

有人可以帮我找出原因吗？

Answer 1

C 中的内存按行优先顺序存储。因此，我们需要按照这个顺序建立索引。另外，由于我传递 int 数组，因此我更改了内核的参数类型。这是代码：

cp_add_arrs = cp.RawKernel(r'''
extern "C" __global__
void add_arrs(int* x1, int* x2, int* y, int N){
  int i = blockDim.x * blockIdx.x + threadIdx.x;
  int j = blockDim.y * blockIdx.y + threadIdx.y;
  
  if(i < N && j < N){
    y[j + i*N] = x1[j + i*N] + x2[j + i*N];
  }
}
''', 'add_arrs')

x1 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
x2 = cp.ones(25, dtype=cp.int32).reshape(5, 5)
y = cp.zeros((5, 5), dtype=cp.int32)
N = x1.shape[0]
# Grid and block sizes
tpb = (16, 16)
bpg = (x1.shape[0] // tpb[0] + 1, x1.shape[1] // tpb[0] + 1)
# Call kernel
cp_add_arrs(bpg, tpb, (x1, x2, y, cp.int32(N)))

y
array([[2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2],
       [2, 2, 2, 2, 2]], dtype=int32)

Answer 2

可能令人困惑的是，只有块数和线程数。

网格尺寸 (bx,by,bz) --> 块[bx,by,bz]

区块维度 (tx,ty,tz) ---> 线程[tx,ty,tz]

一个线程例如是块[a,b,c][d,e,f]

可以将其映射到计数器 K，请看示例。如果你愿意的话，可以摆弄它：

matmul = cp.RawKernel(r'''
extern "C" __global__
void matmul_kernel(float *A, float *B, float *C) {
    int K = threadIdx.x
      +  blockDim.x  * threadIdx.y
      +  blockDim.x  *  blockDim.y  * threadIdx.z
      +  blockDim.x  *  blockDim.y  *  blockDim.z  * blockIdx.x
      +  blockDim.x  *  blockDim.y  *  blockDim.z  *  gridDim.x  * blockIdx.y
      +  blockDim.x  *  blockDim.y  *  blockDim.z  *  gridDim.x  *  gridDim.y  * blockIdx.z ;

    // Actually there is blockDim and threadDim  all xyz there can only one iterator with this. 
    

      if (K<10000) { 
        if (K%9==0) {C[K]=gridDim.x;}
        if (K%9==1) {C[K]=gridDim.y;}
        if (K%9==2) {C[K]=gridDim.z;}
        if (K%9==3) {C[K]=blockDim.x;}
        if (K%9==4) {C[K]=blockDim.y;}
        if (K%9==5) {C[K]=blockDim.z;}
        if (K%9==6) {C[K]=threadIdx.x;}
        if (K%9==7) {C[K]=threadIdx.y;}
        if (K%9==8) {C[K]=threadIdx.z;}
      C[K]=K; 
      }

}
''', 'matmul_kernel')
x1 = cp.random.random(10000, dtype=cp.float32)
x2 = cp.random.random(10000, dtype=cp.float32)
y = cp.zeros((10000), dtype=cp.float32)
matmul((32,32,32,), (10,10,10,), (x1, x2, y)) # grid, block and arguments  # first of all blocks are huge but threads.xyz are limited to total 1024 10,10,10
max(y)

2D Cuda 网格内核中的 Cupy 索引？

问题描述投票：0回答：2

2个回答

最新问题

2D Cuda 网格内核中的 Cupy 索引？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2