“cudaMalloc”无意中在多个 GPU 上分配内存，而不是仅 1 个

Question

我在使用 CUDA 的系统上遇到了一个奇怪的问题。最初，我认为问题出在 pytorch 上，但这个自定义 CUDA C 代码仍然存在问题。

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s <GPU_ID> <Memory_Size_GB>\n", argv[0]);
        return 1;
    }

    int gpu_id = atoi(argv[1]);
    size_t memory_size_gb = atoll(argv[2]);

    
    // Set the GPU
    cudaError_t cudaStatus = cudaSetDevice(gpu_id);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
        return 1;
    }

    // Convert GB to bytes for memory allocation
    size_t size = memory_size_gb * 1024 * 1024 * 1024;

    // Allocate memory on the GPU
    void *gpu_memory;
    cudaStatus = cudaMalloc(&gpu_memory, size);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed to allocate %zu bytes!\n", size);
        return 1;
    }

    printf("Allocated %zu GB of memory on GPU %d\n", memory_size_gb, gpu_id);
    printf("Press any key to free memory and exit...\n");

    getchar(); // Wait for key press

    // Free the memory when done
    cudaStatus = cudaFree(gpu_memory);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaFree failed!");
        return 1;
    }

    return 0;
}

当我运行程序时，它会在系统的两个 GPU 上分配内存，而不仅仅是 GPU#0。

$ ./gtest 0 5
Allocated 5 GB of memory on GPU 0
Press any key to free memory and exit...

# Faulty system
$ nvidia-smi    
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:01:00.0 Off |                  Off |
| 35%   68C    P0    72W / 250W |   5282MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P6000        Off  | 00000000:02:00.0 Off |                  Off |
| 29%   63C    P0    63W / 250W |   5282MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

如果我不使用

cudaSetDevice(gpu_id)

，问题仍然存在。

我在使用相同程序的另一个类似系统上没有得到相同的行为。请注意，在此系统上，内存仅分配在 GPU#0 上，而不分配在其他 GPU 上。

# Working System
$ nvidia-smi  
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 22%   32C    P2    67W / 250W |   5262MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
| 22%   28C    P8    14W / 250W |      5MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

有关故障系统的其他信息

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      PHB     0-7             N/A
GPU1    PHB      X      0-7             N/A
...

Answer 1

抱歉，这不是答案，但我确实拥有机器的 root 访问权限，我希望找到解决方案，但还没有声誉来发表评论。

我做了一些额外的挖掘。我们拥有三台具有相同软件设置的机器，但安装的 GPU 有所不同。 Emoticon 发布的相同代码在其他两台机器上的效果与广告中的效果相同，但在这台机器上却不行。

我在

Xorg.0.log

文件中注意到有人抱怨

Auto

在它不工作的机器上不受支持的设置。但将

SLI

更改为

mosaic

似乎没有效果。而且事实证明这个错误发生在一台没有发生这个问题的机器上。

我还发现了与

CUDA_VISIBLE_DEVICES

环境变量一起使用时的一些奇怪行为。在其他两台机器上运行

CUDA_VISIBLE_DEVICES=0 ./gtest 0 5

和

CUDA_VISIBLE_DEVICES=1 ./gtest 0 5

可以按预期工作（分别在第 0 个和第 1 个 GPU 上分配内存）。但在有问题的情况下，当设置为

时，我会遇到分段错误，并且当设置为

时，两张卡上都会分配内存。

“cudaMalloc”无意中在多个 GPU 上分配内存，而不是仅 1 个

问题描述投票：0回答：1

1个回答

最新问题

“cudaMalloc”无意中在多个 GPU 上分配内存，而不是仅 1 个

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1