Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

Question

Slurm 中有两种分配 GPU 的方式：要么是通用的

--gres=gpu:N

参数，要么是像

--gpus-per-task=N

这样的特定参数。还有两种方法可以在批处理脚本中启动 MPI 任务：使用

srun

，或使用通常的

mpirun

（当 OpenMPI 编译时支持 Slurm）。我发现这些方法之间的行为存在一些令人惊讶的差异。

我正在使用

sbatch

提交批处理作业，其中基本脚本如下：

#!/bin/bash

#SBATCH --job-name=sim_1        # job name (default is the name of this file)
#SBATCH --output=log.%x.job_%j  # file name for stdout/stderr (%x will be replaced with the job name, %j with the jobid)
#SBATCH --time=1:00:00          # maximum wall time allocated for the job (D-H:MM:SS)
#SBATCH --partition=gpXY        # put the job into the gpu partition
#SBATCH --exclusive             # request exclusive allocation of resources
#SBATCH --mem=20G               # RAM per node
#SBATCH --threads-per-core=1    # do not use hyperthreads (i.e. CPUs = physical cores below)
#SBATCH --cpus-per-task=4       # number of CPUs per process

## nodes allocation
#SBATCH --nodes=2               # number of nodes
#SBATCH --ntasks-per-node=2     # MPI processes per node

## GPU allocation - variant A
#SBATCH --gres=gpu:2            # number of GPUs per node (gres=gpu:N)

## GPU allocation - variant B
## #SBATCH --gpus-per-task=1       # number of GPUs per process
## #SBATCH --gpu-bind=single:1     # bind each process to its own GPU (single:<tasks_per_gpu>)

# start the job in the directory it was submitted from
cd "$SLURM_SUBMIT_DIR"

# program execution - variant 1
mpirun ./sim

# program execution - variant 2
#srun ./sim

第一个块中的

#SBATCH

选项非常明显且无趣。接下来，当作业在至少 2 个节点上运行时，我将描述的行为是可以观察到的。我每个节点运行 2 个任务，因为每个节点有 2 个 GPU。最后，有两种 GPU 分配变体（A 和 B）和两种程序执行变体（1 和 2）。因此，总共有 4 个变体：A1、A2、B1、B2。

变体 A1（--gres=gpu:2，mpirun）

变体 A2 (--gres=gpu:2, srun)

在变体 A1 和 A2 中，作业均以最佳性能正确执行，我们在日志中有以下输出：

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 3: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1

变体 B1（--gpus-per-task=1，mpirun）

作业未正确执行，由于第二个节点上的

CUDA_VISIBLE_DEVICES=0

，GPU 未正确映射：

Rank 0: rank on node is 0, using GPU id 0 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 1: rank on node is 1, using GPU id 1 of 2, CUDA_VISIBLE_DEVICES=0,1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0

请注意，无论有没有

--gpu-bind=single:1

，此变体的行为都是相同的。

变体 B2（--gpus-per-task=1，--gpu-bind=single：1，srun）

GPU 已正确映射（现在每个进程只能看到一个 GPU，因为

--gpu-bind=single:1

）：

Rank 0: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 1: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1
Rank 2: rank on node is 0, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=0
Rank 3: rank on node is 1, using GPU id 0 of 1, CUDA_VISIBLE_DEVICES=1

但是，当等级开始通信时会出现 MPI 错误（每个等级重复一次类似的消息）：

--------------------------------------------------------------------------
The call to cuIpcOpenMemHandle failed. This is an unrecoverable error
and will cause the program to abort.
  Hostname:                         gp11
  cuIpcOpenMemHandle return value:  217
  address:                          0x7f40ee000000
Check the cuda.h file for what the return value means. A possible cause
for this is not enough free device memory.  Try to reduce the device
memory footprint of your application.
--------------------------------------------------------------------------

虽然它说“这是一个不可恢复的错误”，但执行似乎进行得很好，除了日志中充斥着这样的消息（假设每个 MPI 通信调用有一条消息）：

[gp11:122211] Failed to register remote memory, rc=-1
[gp11:122212] Failed to register remote memory, rc=-1
[gp12:62725] Failed to register remote memory, rc=-1
[gp12:62724] Failed to register remote memory, rc=-1

显然这是一条 OpenMPI 错误消息。我发现了一个关于此错误的old thread，建议使用

--mca btl_smcuda_use_cuda_ipc 0

禁用CUDA IPC。但是，由于本例中使用了

srun

来启动程序，所以我不知道如何将此类参数传递给 OpenMPI。

请注意，在此变体中

--gpu-bind=single:1

仅影响可见 GPU (

CUDA_VISIBLE_DEVICES

)。但即使没有这个选项，每个任务仍然能够选择正确的 GPU，并且错误仍然出现。

知道发生了什么以及如何解决变体 B1 和 B2 中的错误吗？理想情况下，我们希望使用比

--gpus-per-task

更灵活的

--gres=gpu:...

（当我们更改

--ntasks-per-node

时，需要更改的参数少了一个）。使用

mpirun

与

srun

对我们来说并不重要。

我们有 Slurm 20.11.5.1、OpenMPI 4.0.5（使用

--with-cuda

和

--with-slurm

构建）和 CUDA 11.2.2。操作系统是Arch Linux。网络是 10G 以太网（无 InfiniBand 或 OmniPath）。让我知道是否应该包含更多信息。

Answer 1

我遇到了相关问题。与

一起跑步

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:1

将导致进程共享单个 GPU

"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"
"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

我认为这是正确的。

与

一起跑步

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gpus-per-task=1

将导致只有最后一个进程接收 GPU

"PROCID=2: No devices found."
"PROCID=3: No devices found."
"PROCID=0: No devices found."
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2)"

注意连续运行中的不同 ID

"PROCID=2: No devices found."
"PROCID=1: No devices found."
"PROCID=3: No devices found."
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360)"

与

一起跑步

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

将导致每个进程都可以访问所有 4 个 GPU

"PROCID=3: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=1: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"
"PROCID=0: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-715daa1d-db6f-9e69-ab48-190158bd5360) GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-02348a17-a825-300c-0336-48e33d0dadb2) GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0) GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-7843f55f-a15b-1d4c-229c-39b5c439bd5e)"

与

一起跑步

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
#SBATCH --gpu-bind=single:1

将再次导致只有最后一个进程接收 GPU

"PROCID=1: No devices found."
"PROCID=0: No devices found."
"PROCID=3: No devices found."
"PROCID=2: GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-fbd9a227-e473-b993-215f-8f39b3574fd0)"

Answer 2

好吧，FWIW - 变体 B1 不起作用，因为

mpirun

在幕后使用

srun

仅用于启动其守护进程。只有一个守护进程/节点，因此

srun

只为该任务（守护进程）分配一个 GPU。然后守护进程 fork/exec 应用程序进程，它继承 GPU 分配 envar。

变体 A1 只能工作，因为您要求两个 GPU/任务，并且您碰巧在每个节点上运行两个应用程序进程。如果您运行了第三个进程，或者只要求一个 GPU/任务，它就会失败 - 原因与上面给出的相同。

Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

问题描述投票：0回答：2

变体 A1（--gres=gpu:2，mpirun）

变体 A2 (--gres=gpu:2, srun)

变体 B1（--gpus-per-task=1，mpirun）

变体 B2（--gpus-per-task=1，--gpu-bind=single：1，srun）

2个回答

最新问题

Slurm 中的 GPU 分配：--gres 与 --gpus-per-task，以及 mpirun 与 srun

问题描述 投票：0回答：2

变体 A1（--gres=gpu:2，mpirun）

变体 A2 (--gres=gpu:2, srun)

变体 B1（--gpus-per-task=1，mpirun）

变体 B2（--gpus-per-task=1，--gpu-bind=single：1，srun）

2个回答

最新问题

问题描述投票：0回答：2