运行时错误:CUDA 错误:设备端断言已触发

问题描述 投票:0回答:1

我一直在尝试重现这个repo的结果- https://github.com/sefcom/VarBERT/tree/main

我能够为传销目标训练 BERT 模型。但在约束屏蔽语言模型训练中,我一直面临错误 -

`/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/transformers/optimization.py:429: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
05/14/2024 06:58:30 - INFO - __main__ -   ***** Running training *****
05/14/2024 06:58:30 - INFO - __main__ -     Num examples = 4509495
05/14/2024 06:58:30 - INFO - __main__ -     Num Epochs = 30
05/14/2024 06:58:30 - INFO - __main__ -     Instantaneous batch size per GPU = 32
05/14/2024 06:58:30 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 32
05/14/2024 06:58:30 - INFO - __main__ -     Gradient Accumulation steps = 1
05/14/2024 06:58:30 - INFO - __main__ -     Total optimization steps = 4227660
05/14/2024 06:58:30 - INFO - __main__ -     Starting fine-tuning.
Epoch:   0%|                                                                                                                                                                           | 0/30 [00:00<?, ?it/s/opt/conda/conda-bld/pytorch_1711403408687/work/aten/src/ATen/native/cuda/Loss.cu:250: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [26,0,0] Assertion `t >= 0 && t < n_classes` failed.t/s]
Iteration:   0%|                                                                                                                                                                   | 0/140922 [00:01<?, ?it/s]
Epoch:   0%|                                                                                                                                                                           | 0/30 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/home/hprakash/VarBERT/varbert/cmlm/training.py", line 944, in <module>
    main()
  File "/home/hprakash/VarBERT/varbert/cmlm/training.py", line 892, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/VarBERT/varbert/cmlm/training.py", line 488, in train
    outputs = model(inputs,labels=labels)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/VarBERT/varbert/cmlm/training.py", line 139, in forward
    masked_lm_loss = loss_fct(prediction_scores.view(-1, vocab_size), labels.view(-1))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/modules/loss.py", line 1179, in forward
    return F.cross_entropy(input, target, weight=self.weight,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hprakash/.conda/envs/HP/lib/python3.11/site-packages/torch/nn/functional.py", line 3059, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.`

我在CPU上运行训练文件并得到这个错误-

IndexError: Target 50001 is out of bounds.

我在网上找到了一些文章,据说这是由于模型的预期词汇量大小与分词器配置中定义的大小之间的差异造成的。我做了相应的更改,但错误仍然存在。我需要解决这个问题才能进一步训练模型。

machine-learning pytorch tokenize bert-language-model
1个回答
0
投票

我不能告诉你哪里出了问题,但我可以告诉你这是一个词汇量问题。

您的损失目标值包含值

50001

您的目标值正在尝试索引到模型输出张量中,该张量的大小为

(bs, C, ...)
,其中
C
是词汇中的项目数。对于您的模型,
C < 50001
,因此您会收到索引错误。

您需要设置模型,以便输出的大小与词汇量的大小相匹配。您应该检查模型输出的大小,验证它的实际大小,然后更新模型配置以使用与您的词汇大小匹配的正确大小。

© www.soinside.com 2019 - 2024. All rights reserved.