当我将张量移动到“cuda”时,会发生错误。当我将张量从“cuda”移动到 cpu 时,情况是一样的。
我已经检查了我的张量的形状和数据类型,一切正常。 有谁知道可能是什么问题吗?
我的回溯:
/opt/conda/conda-bld/pytorch_1678402411778/work/aten/src/ATen/native/cuda/Indexing.cu:1146:
indexSelectLargeIndex: block: [55,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
......
Traceback (most recent call last):
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 989, in _run
results = self._run_stage()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
self._run_sanity_check()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1062, in _run_sanity_check
val_loop.run()
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 134, in run
self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 391, in _evaluation_step
output = call._call_strategy_hook(trainer, hook_name, *step_args)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/xc/.conda/envs/molbart_new/lib/python3.10/site-packages/lightning/pytorch/strategies/strategy.py", line 403, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1027, in validation_step
bs, logits, loss = self.forward(batch)
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 934, in forward
indices, templates_candidates, templates_candidates_score = self.topk_candidates(
File "/home/xc/xc_mol_seq/xc_work/seq_template/modules/model.py", line 1138, in topk_candidates
scores, indices = scores.cpu().detach().numpy(), indices.cpu().detach().numpy()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
在您粘贴的痕迹的开头,我看到:
indexSelectLargeIndex:块:[55,0,0],线程:[0,0,0] 断言
失败。srcIndex < srcSelectDimSize
这是一个形状错误。您正在对索引大于给定维度上的张量的张量进行索引,类似于“IndexOutOfBoundsException”。它与在 cuda/cpu 之间移动张量无关。
堆栈跟踪将您指向代码的另一部分,但这是因为,正如错误中所述:
CUDA 内核错误可能会在其他一些 API 调用中异步报告,因此下面的堆栈跟踪可能不正确。