Cuda 将内核结果减少 2

Question

我正在关注之前回答的问题here，关于如何在 cuda 中实现 allreduce，它链接到 nvidia 的幻灯片。我所拥有的大部分时间都有效（当输入大小是 2 的幂时），但某些输入大小导致的答案似乎偏离了 2。

减少中，一旦有<= 32 elements to reduce, the last warp simply unrolls. The final reduce will sum

partial_sum[0] + partial_sum[1]

。在下面的示例中，我打印出最终部分和之前和之后存储的值，以查看计算偏差的位置。这是程序的输出：

threadIdx 0, partial_sum[0] 14710790.000000
threadIdx 1, partial_sum[1] 18872320.000000
threadIdx 0, partial_sum[0] 33583112.000000
threadIdx 1, partial_sum[1] 33059334.000000
expected 33583110, result 33583112

前两行是存储的部分和。正确的结果是

14710790 + 18872320 = 33583110

，然后应将其存储在第 3 行的

partial_sum[0]

中。前 2 行存储的结果是正确的，但最后加法的第 3 行结果不正确。

这是完整的工作示例，使用

nvcc test.cu

进行编译。任何指导将不胜感激！

#include <iostream>
#include <numeric>
#include <vector>

constexpr unsigned int THREADS_PER_DIM = 512;

__global__ void reduce(float *v, float *v_r, std::size_t N, bool print) {
    // Allocate shared memory
    volatile __shared__ float partial_sum[THREADS_PER_DIM * 4];

    // Load elements AND do first add of reduction
    // We halved the number of blocks, and reduce first 2 elements into current and what would be the next block
    unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

    // Store first partial result instead of just the elements
    // Set to 0 first to be safe
    partial_sum[threadIdx.x + blockDim.x] = 0;
    partial_sum[threadIdx.x] = 0;
    partial_sum[threadIdx.x] = (i < N ? v[i] : 0) + ((i + blockDim.x) < N ? v[i + blockDim.x] : 0);
    __syncthreads();

    // Start at 1/2 block stride and divide by two each iteration
    // Stop early (call device function instead)
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        // Each thread does work unless it is further than the stride
        if (threadIdx.x < s) {
            partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
        }
        __syncthreads();
    }

    if (threadIdx.x < 32) {
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 32];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 16];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 8];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 4];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 2];
        if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
            printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
        }

        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 1];

        if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
            printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
        }
    }

    // Let the thread 0 for this block write it's result to main memory
    // Result is inexed by this block
    if (threadIdx.x == 0) {
        v_r[blockIdx.x] = partial_sum[0];
    }
}

constexpr std::size_t N = (1 << 13) + 4;
int main() {
    // Init and fill vec
    std::vector<float> a(N);
    for (std::size_t i = 0; i < N; ++i) {
        a[i] = static_cast<float>(i);
    }

    // allocate and copy on device
    std::size_t bytes = N * sizeof(float);
    float *d_A;
    float *d_A_reduction;
    cudaMalloc((void **)&d_A, bytes);
    cudaMalloc((void **)&d_A_reduction, bytes);
    cudaMemset(d_A_reduction, 0, bytes);
    cudaMemcpy(d_A, a.data(), bytes, cudaMemcpyHostToDevice);

    // 8192 + 4 elements with 512 threads = 16 + 1 blocks
    // Each block actually considered elements in adjacent block on first add in kernel, so we need 9 blocks
    auto num_blocks = 9;

    reduce<<<num_blocks, THREADS_PER_DIM>>>(d_A, d_A_reduction, N, false);
    cudaMemcpy(d_A, d_A_reduction, bytes, cudaMemcpyDeviceToDevice);
    reduce<<<1, THREADS_PER_DIM>>>(d_A, d_A_reduction, num_blocks, true);

    float result = 0;
    cudaMemcpy(&result, d_A_reduction, sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_A);
    cudaFree(d_A_reduction);

    auto correct_result = std::accumulate(a.begin(), a.end(), 0.0);
    std::cout << "expected " << static_cast<int>(correct_result) << ", result " << static_cast<int>(result)
              << std::endl;
}

Answer 1

正如评论中指出的，有两个问题。 Cuda 9 更改了关于保持锁步的扭曲中线程的保证，之前在这里回答：Cuda 最小扭曲减少产生竞争条件

修复此问题后的第二个问题是由于浮点表示不精确，其中

32774 + 33550336

不能用

33583110

表示，而是用

33583112

表示，与2匹配。

Cuda 将内核结果减少 2

问题描述投票：0回答：1

1个回答

最新问题

Cuda 将内核结果减少 2

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1