我正在关注之前回答的问题here,关于如何在 cuda 中实现 allreduce,它链接到 nvidia 的幻灯片。我所拥有的大部分时间都有效(当输入大小是 2 的幂时),但某些输入大小导致的答案似乎偏离了 2。
减少中,一旦有<= 32 elements to reduce, the last warp simply unrolls. The final reduce will sum
partial_sum[0] + partial_sum[1]
。在下面的示例中,我打印出最终部分和之前和之后存储的值,以查看计算偏差的位置。这是程序的输出:
threadIdx 0, partial_sum[0] 14710790.000000
threadIdx 1, partial_sum[1] 18872320.000000
threadIdx 0, partial_sum[0] 33583112.000000
threadIdx 1, partial_sum[1] 33059334.000000
expected 33583110, result 33583112
前两行是存储的部分和。正确的结果是
14710790 + 18872320 = 33583110
,然后应将其存储在第 3 行的 partial_sum[0]
中。前 2 行存储的结果是正确的,但最后加法的第 3 行结果不正确。
这是完整的工作示例,使用
nvcc test.cu
进行编译。任何指导将不胜感激!
#include <iostream>
#include <numeric>
#include <vector>
constexpr unsigned int THREADS_PER_DIM = 512;
__global__ void reduce(float *v, float *v_r, std::size_t N, bool print) {
// Allocate shared memory
volatile __shared__ float partial_sum[THREADS_PER_DIM * 4];
// Load elements AND do first add of reduction
// We halved the number of blocks, and reduce first 2 elements into current and what would be the next block
unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;
// Store first partial result instead of just the elements
// Set to 0 first to be safe
partial_sum[threadIdx.x + blockDim.x] = 0;
partial_sum[threadIdx.x] = 0;
partial_sum[threadIdx.x] = (i < N ? v[i] : 0) + ((i + blockDim.x) < N ? v[i + blockDim.x] : 0);
__syncthreads();
// Start at 1/2 block stride and divide by two each iteration
// Stop early (call device function instead)
for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
// Each thread does work unless it is further than the stride
if (threadIdx.x < s) {
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x < 32) {
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 32];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 16];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 8];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 4];
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 2];
if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
}
partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 1];
if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
}
}
// Let the thread 0 for this block write it's result to main memory
// Result is inexed by this block
if (threadIdx.x == 0) {
v_r[blockIdx.x] = partial_sum[0];
}
}
constexpr std::size_t N = (1 << 13) + 4;
int main() {
// Init and fill vec
std::vector<float> a(N);
for (std::size_t i = 0; i < N; ++i) {
a[i] = static_cast<float>(i);
}
// allocate and copy on device
std::size_t bytes = N * sizeof(float);
float *d_A;
float *d_A_reduction;
cudaMalloc((void **)&d_A, bytes);
cudaMalloc((void **)&d_A_reduction, bytes);
cudaMemset(d_A_reduction, 0, bytes);
cudaMemcpy(d_A, a.data(), bytes, cudaMemcpyHostToDevice);
// 8192 + 4 elements with 512 threads = 16 + 1 blocks
// Each block actually considered elements in adjacent block on first add in kernel, so we need 9 blocks
auto num_blocks = 9;
reduce<<<num_blocks, THREADS_PER_DIM>>>(d_A, d_A_reduction, N, false);
cudaMemcpy(d_A, d_A_reduction, bytes, cudaMemcpyDeviceToDevice);
reduce<<<1, THREADS_PER_DIM>>>(d_A, d_A_reduction, num_blocks, true);
float result = 0;
cudaMemcpy(&result, d_A_reduction, sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_A);
cudaFree(d_A_reduction);
auto correct_result = std::accumulate(a.begin(), a.end(), 0.0);
std::cout << "expected " << static_cast<int>(correct_result) << ", result " << static_cast<int>(result)
<< std::endl;
}
正如评论中指出的,有两个问题。 Cuda 9 更改了关于保持锁步的扭曲中线程的保证,之前在这里回答:Cuda 最小扭曲减少产生竞争条件
修复此问题后的第二个问题是由于浮点表示不精确,其中
32774 + 33550336
不能用33583110
表示,而是用33583112
表示,与2匹配。