我想弄清楚同一个 Ubuntu 20.04 系统上的两个 Nvidia 2070S GPU 是否可以通过 NCCL 和 Pytorch 1.8 相互访问。
我的测试脚本基于 Pytorch 文档,但后端从
"gloo"
更改为 "nccl"
.
后台
"gloo"
时,不到一分钟脚本运行完毕
$ time python test_ddp.py
Running basic DDP example on rank 0.
Running basic DDP example on rank 1.
real 0m4.839s
user 0m4.980s
sys 0m1.942s
但是,当后端设置为
"nccl"
时,脚本会卡在下面的输出中,并且永远不会返回到 bash 提示符。
$ python test_ddp.py
Running basic DDP example on rank 1.
Running basic DDP example on rank 0.
禁用IB时同样的问题
$ NCCL_IB_DISABLE=1 python test_ddp.py
Running basic DDP example on rank 1.
Running basic DDP example on rank 0.
我正在使用的包:
我们如何解决使用NCCL时出现的问题?谢谢!
用于测试NCCL的Python代码:
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
def setup(rank, world_size):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
# gloo: works
# dist.init_process_group("gloo", rank=rank, world_size=world_size)
# nccl: hangs forever
dist.init_process_group(
"nccl", init_method="tcp://10.1.1.20:23456", rank=rank, world_size=world_size
)
def cleanup():
dist.destroy_process_group()
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# create model and move it to GPU with id rank
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()
cleanup()
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn, args=(world_size,), nprocs=world_size, join=True)
if __name__ == "__main__":
run_demo(demo_basic, 2)
我建议使用 Pytorch 的 torchrun 让 GPU 使用 NCCL 进行通信。您链接到的同一页面还显示了如何使用torchrun。
mp.spawn()
就没有必要了,因为 torchrun 会产生容错子进程。