当尝试使用 SLURM 脚本在集群上的 2 个节点(每个节点有 2 个 GPU)上通过
torch.distributed.run
运行示例 python 文件时,我遇到以下错误:
[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:16773 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [clara06.url.de]:16773 (errno: 97 - Address family not supported by protocol).
这是 SLURM 脚本:
#!/bin/bash
#SBATCH --job-name=distribution-test # name
#SBATCH --nodes=2 # nodes
#SBATCH --ntasks-per-node=1 # crucial - only 1 task per dist per node!
#SBATCH --cpus-per-task=4 # number of cores per tasks
#SBATCH --partition=clara
#SBATCH --gres=gpu:v100:2 # number of gpus
#SBATCH --time 0:15:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=%x-%j.out # output file name
module load Python
pip install --user -r requirements.txt
MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
GPUS_PER_NODE=2
LOGLEVEL=INFO python -m torch.distributed.run --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
以及应该运行的Python代码:
import fcntl
import os
import socket
import torch
import torch.distributed as dist
def printflock(*msgs):
"""solves multi-process interleaved print problem"""
with open(__file__, "r") as fh:
fcntl.flock(fh, fcntl.LOCK_EX)
try:
print(*msgs)
finally:
fcntl.flock(fh, fcntl.LOCK_UN)
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
device = torch.device("cuda", local_rank)
hostname = socket.gethostname()
gpu = f"[{hostname}-{local_rank}]"
try:
# test distributed
dist.init_process_group("nccl")
dist.all_reduce(torch.ones(1).to(device), op=dist.ReduceOp.SUM)
dist.barrier()
# test cuda is available and can allocate memory
torch.cuda.is_available()
torch.ones(1).cuda(local_rank)
# global rank
rank = dist.get_rank()
world_size = dist.get_world_size()
printflock(f"{gpu} is OK (global rank: {rank}/{world_size})")
dist.barrier()
if rank == 0:
printflock(f"pt={torch.__version__}, cuda={torch.version.cuda}, nccl={torch.cuda.nccl.version()}")
except Exception:
printflock(f"{gpu} is broken")
raise
我尝试过不同的 python 运行,如下所示:
LOGLEVEL=INFO python -m torch.distributed.run --master_addr $MASTER_ADDR --master_port $MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
LOGLEVEL=INFO torchrun --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
LOGLEVEL=INFO python -m torch.distributed.launch --rdzv_id=$SLURM_JOBID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR\:$MASTER_PORT --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES torch-distributed-gpu-test.py
所有结果都导致相同的错误。
我尝试明确指定 IP 地址而不是
MASTER_ADDR
IP_ADDRESS=$(srun hostname --ip-address | head -n 1)
/etc/resolv.conf
:主机名已清晰映射.ipv4
附加到 MASTER_ADDR 来指定 IP 版本,但没有成功。未找到地址系列错误与 IPv4 和 IPv6 版本相关。 由于我的服务没有提供节点之间的 ipv6 连接,因此发生了这些错误。
但它们可以理解为警告,通过 IPv4 的连接仍然建立。
我没有找到任何禁用 IPv6 连接的解决方案,但因为它们只是“信息”,所以可以说,我忽略了它们