我正在 2 台机器上训练一个网络,每台机器由两个 GPU 组成。我已经检查了
PORT
号码以将两台机器相互连接,但每次我都会收到错误。
如何查找端口号?
sudo lsof -i :22 | grep LISTEN
sshd 2101 root 3u IPv4 57356 0t0 TCP *:ssh (LISTEN)
sshd 2101 root 4u IPv6 57358 0t0 TCP *:ssh (LISTEN)
脚本
python imagenet_multi_node.py -a resnet50 --dist-url tcp://10.246.246.22:57356 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 -b 128 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/
追溯:
Use GPU: 1 for training
Use GPU: 0 for training
Traceback (most recent call last):
File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 511, in <module>
main()
File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 117, in main
mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 137, in main_worker
dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
return TCPStore(
RuntimeError: Address already in use
你的问题解决了吗?我目前使用 torch 2.4.0 进行多节点训练并按照上面的教程进行操作,但连接时它挂起