为什么转换CUDA网格(但不是它的线程块)仍然会减慢计算速度?

问题描述 投票:13回答:1

编辑:似乎,至少在这种情况下,转置网格对L2缓存带宽有负面影响。这是从视觉分析器获得的。原因尚不清楚。

我已经到了需要转置CUDA网格的GPU计算情况。因此,如果block_ {x,y}最初作用于数据区域d_ {x,y},则它现在作用于数据区域d_ {y,x},因此block_ {y,x}将作用于数据区域d_ {x, Y}。下图显示了一个示例。 enter image description here

值得一提的是,线程不会在每个块内部进行转置,也就是说,一旦定位了块,threadIdx.x和threadIdx.y值分别以正常方式用于它们的x和y偏移。

据我所知,理论上这个设计应该不会对性能产生任何影响,因为内存合并模式仍然保留,即块内的线程没有转置,只是重新排列其块的网格。但是我发现在转置网格时,内核运行大约。比正常情况慢2倍。我做了一个玩具示例来说明情况。

➜  transpose-grid ./prog 10000 10000 100 0
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
normal_kernel (100 rep).......done: 0.935132 ms
verifying correctness.........ok
➜  transpose-grid ./prog 10000 10000 100 1
init data.....................done: zero matrix of 10000 x 10000
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(313, 313, 1)
transp_kernel (100 rep).......done: 1.980445 ms
verifying correctness.........ok

我真的很感激这个问题的任何解释。以下是重现行为的源代码。

 // -----------------------------------
 // can compile as nvcc main.cu -o prog
 // -----------------------------------

 #include <cuda.h>
 #include <cstdio>

 #define BSIZE2D 32

 __global__ void normal_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.y*blockDim.y + threadIdx.y;
     const int j = blockIdx.x*blockDim.x + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }

 __global__ void transp_kernel(int *dmat, const int m, const int n){
     const int i = blockIdx.x*blockDim.x + threadIdx.y;
     const int j = blockIdx.y*blockDim.y + threadIdx.x;
     if(i < m && j < n){
         dmat[i*n + j] = 1;
     }
 }



 int verify(int *hmat, const int m, const int n){
     printf("verifying correctness........."); fflush(stdout);
     for(int i=0; i<m*n; ++i){
         if(hmat[i] != 1){
             fprintf(stderr, "Incorrect value at m[%i,%i] = %i\n", i/n, i%n);
             return 0;
         }
     }
     printf("ok\n"); fflush(stdout);
     return 1;
 }
 int main(int argc, char **argv){
     if(argc != 5){
         printf("\nrun as ./prog m n r t\n\nr = number of repeats\nt = transpose (1 or 0)\n");
         exit(EXIT_FAILURE);
     }
     const int m = atoi(argv[1]);
     const int n = atoi(argv[2]);
     const int r = atoi(argv[3]);
     const int t = atoi(argv[4]);
     const unsigned int size = m*n;
     cudaEvent_t start, stop;
     cudaEventCreate(&start);
     cudaEventCreate(&stop);
     float time;
     int *hmat, *dmat;


     printf("init data....................."); fflush(stdout);
     hmat = (int*)malloc(sizeof(int)*(size));
     for(int i=0; i<size; ++i){
         hmat[i] = 0;
     }
     printf("done: zero matrix of %i rows x %i cols\n", m, n);




     printf("copy data to GPU.............."); fflush(stdout);
     cudaMalloc(&dmat, sizeof(int)*(size));
     cudaMemcpy(dmat, hmat, sizeof(int)*(size), cudaMemcpyHostToDevice);
     printf("done\n");



     printf("preparing grid................"); fflush(stdout);
     dim3 block(BSIZE2D, BSIZE2D, 1);
     dim3 grid;
     // if transpose or not
     if(t){
         grid = dim3((m + BSIZE2D - 1)/BSIZE2D, (n + BSIZE2D - 1)/BSIZE2D, 1);
     }
     else{
         grid = dim3((n + BSIZE2D - 1)/BSIZE2D, (m + BSIZE2D - 1)/BSIZE2D, 1);
     }
     printf("done: block(%i, %i, %i), grid(%i, %i, %i)\n", block.x, block.y, block.z, grid.x, grid.y, grid.z);


     if(t){
         printf("transp_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             transp_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }
     else{
         printf("normal_kernel (%3i rep).......", r); fflush(stdout);
         cudaEventRecord(start, 0);
         for(int i=0; i<r; ++i){
             normal_kernel<<<grid, block>>>(dmat, m, n);
             cudaDeviceSynchronize();
         }
         cudaEventRecord(stop,0);
         cudaEventSynchronize(stop);
         cudaEventElapsedTime(&time, start, stop); // that's our time!
         printf("done: %f ms\n", time/(float)r);
     }


     cudaMemcpy(hmat, dmat, sizeof(int)*size, cudaMemcpyDeviceToHost);
     verify(hmat, m, n);
     exit(EXIT_SUCCESS);
 }
c++ performance cuda block slowdown
1个回答
1
投票

由于我找不到关于该主题的任何文献,这是我的猜测 - 而不是基于经验(my old problem with memory reading speed)。

正如您所写,您的示例保留了内存合并模式,但它仅在warp级别(连续32个线程)上完成。但要实现全速运行,需要在交互级别上进行合并 - 这里原因尚不清楚是否实际完成了这样的合并,或者在这种情况下缓存和内存在某种程度上更好地工作(可能正如我所描述的那样here更好地利用了记忆爆发模式)。

所以在你的normal_kernel执行中,不仅单个warp被合并,而且还会从下一个块中扭曲。

要在您的示例中进行检查,我已修改您的代码以使用不同的块大小,这是我在1080Ti上的结果:

块大小(32,32)与您的相同:

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(320, 320, 1)
normal_kernel (100 rep).......done: 1.020545 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(32, 32, 1), grid(320, 320, 1)
transp_kernel (100 rep).......done: 1.564084 ms
verifying correctness.........ok

块大小(64,16)不幸的是我们无法创建64,64,因为#threads限制在一个块中:

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(64, 16, 1), grid(160, 640, 1)
normal_kernel (100 rep).......done: 1.020420 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(64, 16, 1), grid(160, 640, 1)
transp_kernel (100 rep).......done: 1.205506 ms
verifying correctness.........ok

块大小(128,8):

~$ ./prog 10240 10240 100 0
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(128, 8, 1), grid(80, 1280, 1)
normal_kernel (100 rep).......done: 1.019547 ms
verifying correctness.........ok

~$ ./prog 10240 10240 100 1
init data.....................done: zero matrix of 10240 rows x 10240 cols
copy data to GPU..............done
preparing grid................done: block(128, 8, 1), grid(80, 1280, 1)
transp_kernel (100 rep).......done: 1.058236 ms
verifying correctness.........ok

我不确定这对您的特定问题是否有帮助,但至少我们还有一些数据需要讨论。

© www.soinside.com 2019 - 2024. All rights reserved.