如何让 Pytorch Lightning 在多个 GPU 上运行?

问题描述 投票:0回答:0

我在 Pytorch Lightning 中有一个模型,我想在多个 GPU 上训练以加快进程,并且一直在关注 https://pytorch-lightning.readthedocs.io/en/stable/accelerators/gpu_intermediate.htmlhttps://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html#build-your-slurm-script。该代码适用于一个 gpu,我将在此处指出我为多个 GPU 所做的更改。我的课程目前看起来像这样:

class model(pl.LightningModule):
def __init__()
def forward()
def configure_optizimer()
def training_step()
def test_step()
def validation_step()
def predict_step()
nothing changed
class MyDataset(Dataset):
def __getitem__():
x = torch.load(f'file_{index}.pt')
y = torch.load(f'file_{index}.pt')`

脚本 train.py:

import MyDataset
import model
def main():
mp.set_start_method('spawn', force=True)

N_PROCESSES = 2
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
training_set = MyDataset(partition_indices['train'])
training_generator = DataLoader(training_set, 
                                batch_size = 64, 
                                shuffle = True, 
                                num_workers = 1,
                                pin_memory=False)
val_set = MyDataset(partition_indices['val'])
val_generator = DataLoader(val_set, 
                                batch_size = 64, 
                                shuffle = False, 
                                num_workers = 1,
                                pin_memory=False)
# it works for one gpu with num_workers = 0. with num_workers = 1 or num_workers > 1 it gives an error.

clf = model()
trainer = pl.Trainer(max_epochs = EPOCHS, 
                    callbacks = callbacks, 
                    enable_checkpointing = True, 
                    logger = logger, 
                    accelerator = 'gpu', 
                    num_nodes = 1,
                    devices = 2, 
                    precision = 16,
                    strategy='ddp)  
trainer.fit(clf, training_generator, val_generator)
# it works for one gpu with devices = 1 and without the strategy argument.

我的 SLURM 提交包含此选项:

#SBATCH --partition=highmemgpu
#SBATCH --gres=gpu:2
#SBATCH --ntasks-per-node=2
#SBATCH --nodes=1
srun python train.py

根据https://pytorch-lightning.readthedocs.io/en/stable/clouds/cluster_advanced.html#build-your-slurm-script: num_nodes = --nodes = 1 设备 = --ntasks-per-node = 2

输出文件的开头:

Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
Using device (set to GPU if available): cuda
Using device (set to GPU if available): cuda

┏━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name     ┃ Type           ┃ Params ┃
┡━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ linear   │ Linear         │ 16.3 M │
│ 1 │ softplus │ Softplus       │      0 │
│ 2 │ loss     │ PoissonNLLLoss │      0 │
└───┴──────────┴────────────────┴────────┘
Trainable params: 16.3 M                                                        
Non-trainable params: 0                                                         
Total params: 16.3 M                                                            
Total estimated model params size (MB): 32                                      
SLURM auto-requeueing enabled. Setting signal handlers.
SLURM auto-requeueing enabled. Setting signal handlers.
/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Using device (set to GPU if available): cuda
12
parameters: 
 n processes: 2 
 batch size: 400 
 max epochs: 1 
 strategy: ddp
Using device (set to GPU if available): cuda
number of train samples: 27216
number of train batches: 69
number of val samples: 6805
number of validation batches: 18
folder where model is stored: ./model_2023-03-09 19:48:10.733840
2023-03-09 19:48:18.607028: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-09 19:48:18.607674: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Using 16bit None Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:16561 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:16561 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/runpy.py", line 289, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/exports/humgen/idenhond/projects/enformer/dnn_head/dnn_head_train/train.py", line 101, in <module>
    trainer.fit(clf, training_generator, val_generator)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit
    call._call_and_handle_interrupt(
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl
    self._run(model, ckpt_path=self.ckpt_path)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1048, in _run
    self.strategy.setup_environment()
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 152, in setup_environment
    self.setup_distributed()
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
    _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/lightning_fabric/utilities/distributed.py", line 245, in _init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 754, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/exports/humgen/idenhond/miniconda3/envs/enformer_dev/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 177, in _create_c10d_store
    return TCPStore(
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:16561 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:16561 (errno: 98 - Address already in use).a

我尝试了很多不同的东西和组合,但我认为我不太了解 GPU 如何与节点和任务一起工作。我没有找到合适的解释,所以希望有人能指出我正确的方向。

我希望此脚本使用 >1 GPU 运行,它们应该可以从我们的资源中获得。

我也不清楚 num_workers 参数是如何工作的。我尝试将其增加到 >1,因为某些地方建议这样做,但这不起作用。

parallel-processing pytorch gpu slurm pytorch-lightning
© www.soinside.com 2019 - 2024. All rights reserved.