Cuda 多设备异步与可分页内存

Question

Cuda 是否提供了一种使用主机上的可分页内存在不同设备之间获得异步性的方法？（请注意，这不是关于在单个设备上的可分页内存上阻塞

cudaMemcpyAsync

的永无止境的问题）

背景如下：通过访问具有多个 Cuda 设备的节点，我将工作负载分批均匀地分布在所有设备上。一个批次由从主机到设备的

num_devices

的

cudaMemcpyAsync

实例（每个设备一个）组成，然后是内核启动，最后是从设备到主机的

cudaMemcpyAsync

。这些

num_devices

实例是通过设备上的

for

循环启动的。但问题在于以下一个：由于我使用的主机内存是可分页的，因此

cudaMemcpyAsync

与主机同步，因此最内层

for

循环中的每次迭代都会发生当且仅当前一个迭代已完全完成，从而防止设备同时工作。

我在下面附上一个最小的例子：

/*
Compilation: nvcc main.cu -o main.cuda
(nvcc version 12)
*/
#include <cuda_runtime.h>
#include <vector>
#include <stdint.h>
#include <cassert>

// trivial kernel for illustration
__global__
void kernel(double* d_u, const uint64_t size)
{
    uint64_t j = blockIdx.x * blockDim.x + threadIdx.x;
    if (j<size) {
        d_u[j] *= 2.0;
    }
}

// providing home-made popcnt in case __builtin_popcount is not supported
unsigned int hm_popcnt(int word) {
    unsigned int n = 0;
    while(word) {
        if (word&1) {++n;}
        word>>=1;
    }
    return n;
}


int main() {

    unsigned int n = 30;
    uint64_t dimension = (1ULL)<<n;

    unsigned int n0 = 27;
    uint64_t batch_size = (1ULL)<<n0;

    int blockSize = 256;
    int numBlocks = (batch_size + blockSize - 1)/blockSize;

    int num_devices;
    cudaGetDeviceCount(&num_devices);
    assert(num_devices!=1); // 1 device, no luck
    assert(__builtin_popcount(num_devices)==1); // for sake of simplicity
    //assert(hm_popcnt(num_devices)==1);

    cudaStream_t streams[num_devices];
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        cudaSetDevice(dev_id);
        cudaStreamCreateWithFlags(&streams[dev_id], cudaStreamNonBlocking);
    }

    std::vector<double> h_v(dimension, 1.0); // pageable memory

    // each device holds its array of dimension <batch_size>
    double * d_v[num_devices];
    for (int dev_id=0; dev_id<num_devices; ++dev_id) {
        cudaSetDevice(dev_id);
        cudaMalloc((void**)&d_v[dev_id], batch_size*sizeof(double));
    }

    uint32_t num_batches = ((1UL)<<(n-n0))/num_devices;

    for (uint32_t i=0; i<num_batches; ++i) {
        for (int dev_id=0; dev_id<num_devices; ++dev_id) {
            cudaSetDevice(dev_id);
            uint64_t start_index = (i*num_devices + dev_id) * batch_size;
            cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
            kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
            cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
            // h_v is on pageable memory, so the cudaMemcpyAsync is synchronous, preventing devices to work at the same time
        }
    }

    for (int dev_id=0; dev_id<num_devices; ++dev_id)
    {
        cudaSetDevice(dev_id);
        cudaFree(d_v[dev_id]);
        cudaStreamDestroy(streams[dev_id]);
    }

    return 0;
}

由于

dimension

在生产环境中非常大，因此无法选择执行从可分页到固定内存的主机到主机复制。相反，我可以通过交换外循环和内循环并在设备上使用 OpenMP

for

循环来解决此问题（代价是涉及

num_devices

CPU 线程，这很好）：

// requires to add #include "omp.h" and to compile with flags -Xcompiler -fopenmp
#pragma omp parallel for schedule(static, 1) num_threads(num_devices)
for (int dev_id=0; dev_id<num_devices; ++dev_id) {
    cudaSetDevice(dev_id);
    for (uint32_t i=0; i<num_batches; ++i) {
        uint64_t start_index = (i*num_devices + dev_id) * batch_size;
        cudaMemcpyAsync(d_v[dev_id], &h_v[start_index], batch_size*sizeof(double), cudaMemcpyHostToDevice, streams[dev_id]);
        kernel<<<numBlocks, blockSize, 0, streams[dev_id]>>>(d_v[dev_id], batch_size);
        cudaMemcpyAsync(&h_v[start_index], d_v[dev_id], batch_size*sizeof(double), cudaMemcpyDeviceToHost, streams[dev_id]);
    }
}

IMO 这不是很优雅，不知怎的，我觉得 Cuda 应该提供一些更干净的东西来在这个简单的场景中实现多设备异步。是这样吗？如果不行还有其他解决办法吗？

Answer 1

当您使用可分页且非固定内存时，cudaMemcpyAsync 函数会阻塞调用线程，类似于 cudaMemcpy 函数。因此，实际上，您必须从不同线程调用 cudaMemcpy 或 cudaMemcpyAsync 来实现设备之间的异步性。但是，我建议使用固定内存。您可以简单地替换可分页内存的分配

std::vector<double> h_v(dimension, 1.0); // pageable memory

调用 cudaMallocHost()！

Cuda 多设备异步与可分页内存

问题描述投票：0回答：1

1个回答

最新问题

Cuda 多设备异步与可分页内存

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1