矩阵向量乘积 CUDA 通过平铺和共享内存提高性能

Question

你好，我正在 CUDA 内核中研究矩阵向量积。我想通过平铺和共享内存提高性能。问题是使用此代码，矩阵

或向量

加载不正确。

您是否知道如何将

和

中的图块加载到共享内存阵列中？

是矩阵，

是向量，

是矩阵向量乘积

的结果

__global__ void matrixMul( float* P, float* M, float* N, int Mw, int Nw)
{
    int bx = blockIdx.x;     int by = blockIdx.y;
    int tx = threadIdx.x;    int ty = threadIdx.y;
    __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE];
    __shared__ float Ns[BLOCK_SIZE];

    // ===================================================================
    // Code segment 1
    // Determine the update values for the tile indices in the loop
    // ===================================================================

    int mBegin = Mw * BLOCK_SIZE * by;
    int mEnd   = mBegin + Mw - 1;
    int mStep  = BLOCK_SIZE;
    int nBegin = BLOCK_SIZE * bx;
    //int nStep  = BLOCK_SIZE*Nw;
    int nStep = 1;
    float Psub = 0.0f;

    // ===================================================================
    // Code segment 2
    // Do matrix-matrix multiplication inside a tile
    // ===================================================================

    for (int m = mBegin, n = nBegin; m <= mEnd; m += mStep, n += nStep) {

        // Load a tile from M and N into the shared memory arrays
        Ms[ty][tx] = M[bx*mStep*Mw+m];
        Ns[ty] = N[by*nStep*Nw+n];

        // Synchronize the threads
        __syncthreads();

        // Multiply the two tiles together, each thread accumulating
        // the partial sum of a single dot product.
        for (int i = 0; i < BLOCK_SIZE; i++) {
            Psub += Ms[i][tx] * Ns[i];
        }

        // Synchronize again.
        __syncthreads();
    }

    // ===================================================================
    // Code segment 3
    // Store the data back to global memory
    // ===================================================================

    int p = Nw * BLOCK_SIZE * by + BLOCK_SIZE * bx;
    P[p + nStep] = Psub;
}

Answer 1

我找到了一个类似的例子（请注意，处理相同大小的方阵），它也将矩阵的一部分加载到共享内存中。看起来你的声明是正确的，它可能只是归结为你用来确定哪些元素去哪里的代数。

__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width){

    __shared__float Mds[TILE_WIDTH][TILE_WIDTH];  // Shared memory
    __shared__float Nds[TILE_WIDTH][TILE_WIDTH];  //   declarations

    int bx = blockIdx.x; int by = blockIdx.y; // ID thread
    int tx = threadIdx.x; int ty = threadIdx.y;

    // Identify the row and column of the Pd element to work on

    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;

    float Pvalue = 0; // REGISTER!

    // Loop over the Md and Nd tiles required to compute the Pd element
    for (int m = 0; m < Width/TILE_WIDTH; ++m) { 
        // Collaborative loading of Md and Nd tiles into shared memory
        Mds[ty][tx] = Md[Row*Width + (m*TILE_WIDTH + tx)];
        Nds[ty][tx] = Nd[Col + (m*TILE_WIDTH + ty)*Width];

        __syncthreads();

        for (int k = 0; k < TILE_WIDTH; ++k)
            Pvalue +=  Mds[ty][k] * Nds[k][tx];

        __syncthreads();
    }
    Pd[Row*Width+Col] = Pvalue;
}

矩阵向量乘积 CUDA 通过平铺和共享内存提高性能

问题描述投票：0回答：1

1个回答

最新问题

矩阵向量乘积 CUDA 通过平铺和共享内存提高性能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1