CUBLAS转置矩阵乘法问题

问题描述 投票:0回答:1

我正在尝试将CUBLAS中的C = At * B乘以。问题是,使用我拥有的代码(从this中获取),有些矩阵维度似乎可以正常工作int rows_a = 1, cols_a = 200, rows_b = 1, cols_b = 200。相反,在某些维度中值不正确int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;

在我的代码中,设置了两个矩阵,然后使用CUBLAS函数cublasSgemm进行乘法,然后,我使用某些CPU函数进行了相同的矩阵乘法,以检查是否正常。

int main(int argc, char *argv[])
{
    cublasCreate(&handle);

    int rows_a = 200, cols_a = 5, rows_b = 200, cols_b = 5;

    float al = 1.0f;
    float bet = 0.0f;
    float *a = (float *)malloc(rows_a * cols_a * sizeof(float));
    float *b = (float *)malloc(rows_b * cols_b * sizeof(float));
    float *c = (float *)malloc(cols_a * cols_b * sizeof(float)); // CUBLAS result
    float *cpu= (float *)malloc(cols_a * cols_b * sizeof(float)); // CPU result

    for (int i = 0; i < rows_a * cols_a; i++)
    {
        a[i] = i;
    }

    for (int i = 0; i < rows_b * cols_b; i++)
    {
        b[i] = i*4;
    }

    float *dev_a, *dev_b, *dev_c;
    cudaMalloc((void **)&dev_a, rows_a * cols_a * sizeof(float));
    cudaMalloc((void **)&dev_b, rows_b * cols_b * sizeof(float));
    cudaMalloc((void **)&dev_c, cols_a * cols_b * sizeof(float));

    cudaMemcpy(dev_a, a, rows_a * cols_a * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, rows_b * cols_b * sizeof(float), cudaMemcpyHostToDevice);

    cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, cols_b, cols_a, rows_b, &al, dev_b, cols_b, dev_a, cols_a, &bet, dev_c, cols_a);

    cudaMemcpy(c, dev_c, cols_a * cols_b * sizeof(float), cudaMemcpyDeviceToHost);
    printMatriz(c, cols_a, cols_b);

    //CPU
    for (int i = 0; i < cols_a; i++)
    {
        for (int j = 0; j < cols_b; j++)
        {
            float v = 0;
            for (int k = 0; k < rows_a; k++)
            {
                v += a[(cols_a * k) + i] * b[(cols_b * k) + j];
            }
            cpu[(i * cols_b) + j] = v;
        }
    }

    printMatriz(cpu, cols_a, cols_b);
}

错误的输出:

(cublas)
264670000.000000 265068000.000000 265466000.000000 265864000.000000 266262000.000000 
265068000.000000 265466800.000000 265865600.000000 266264400.000000 266663200.000000
...

(cpu)
264669856.000000 265068016.000000 265466144.000000 265864000.000000 266261856.000000 
265068016.000000 265466656.000000 265865584.000000 266264544.000000 266663184.000000 
...

我希望这两个结果必须相同,并且显然我的实现不正确。有人可以帮我吗?谢谢!

cuda transpose multiplication cublas
1个回答
1
投票

我认为您只是想达到浮点精度,这些值彼此之间只有几位之遥。例如在"hex notation"中:

265068000 is 0x1.f993bcp+27
265068016 is 0x1.f993bep+27

注意,最后一位数字只有3(0xf993bc - 0xf993be)的变化,考虑到它是在200个四舍五入后关闭的,所以这很好。

请注意,32位float通常适合大约7位十进制数字,而64位double适合大约15位十进制数字。

© www.soinside.com 2019 - 2024. All rights reserved.