Pytorch 多节点训练返回 TCPStore( RuntimeError: 地址已在使用中

问题描述 投票:0回答:1

我正在 2 台机器上训练一个网络,每台机器由两个 GPU 组成。我已经检查了

PORT
号码以将两台机器相互连接,但每次我都会收到错误。

如何查找端口号?

sudo lsof -i :22 | grep LISTEN

sshd    2101    root    3u  IPv4  57356      0t0  TCP *:ssh (LISTEN)
sshd    2101    root    4u  IPv6  57358      0t0  TCP *:ssh (LISTEN)

脚本

python imagenet_multi_node.py -a resnet50 --dist-url tcp://10.246.246.22:57356 --dist-backend 'nccl' --multiprocessing-distributed --world-size 2 --rank 0 -b 128 /home2/coremax/Documents/ILSVRC/Data/CLS-LOC/

追溯:

Use GPU: 1 for training
Use GPU: 0 for training
Traceback (most recent call last):
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 511, in <module>
    main()
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 117, in main
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home2/coremax/Documents/GridMask/imagenet_grid/imagenet_multi_node.py", line 137, in main_worker
    dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 576, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 183, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/home2/coremax/anaconda3/lib/python3.9/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: Address already in use
pytorch distributed-computing training-data dataparallel
1个回答
0
投票

你的问题解决了吗?我目前使用 torch 2.4.0 进行多节点训练并按照上面的教程进行操作,但连接时它挂起

最新问题
© www.soinside.com 2019 - 2024. All rights reserved.