使用TorchScript模型并遇到问题：RuntimeError：预期所有张量都在同一设备上，但发现至少两个设备，cuda：0和cpu

Question

我在 python 上训练了一个基于 ALEBF 的模型，出于整体效率的原因，我选择用 c++ 来推理它。我选择了python中的torch.jit.trace来保存模型，并在c++中加载了相应的.pt文件。然而我在模型推理的时候遇到了标题中的问题

首先是我的C++代码：

if (torch::cuda::is_available()) {
    n_model = torch::jit::load("/home/lzh/Storage4/lzh/deepmodel/model_scripted.pt",torch::kCUDA);
    std::cout << torch::cuda::device_count() << std::endl;
} else {

    std::cerr << "No CUDA devices available, cannot move model to GPU." << std::endl;
}
torch::Tensor inputs = torch::from_blob(fre, {1, 4,300, 201}, torch::kFloat).to(torch::kCUDA);
std::cout << inputs.device() << std::endl;
textInput.input_ids.to(torch::kCUDA);
textInput.attention_mask.to(torch::kCUDA);
torch::Tensor out_tensor = n_model.forward({inputs,textInput.input_ids,textInput.attention_mask}).toTensor();

问题出现了：

The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__/models/model_somatic.py", line 14, in forward
    cls_head = self.cls_head
    ALBEF = self.ALBEF
    _0 = (ALBEF).forward(image, input_ids, attention_mask, )
          ~~~~~~~~~~~~~~ <--- HERE
    return (cls_head).forward(_0, )
class ALBEF(Module):
  File "code/__torch__/models/model_somatic.py", line 35, in forward
    _5 = torch.ones([_3, int(_4)], dtype=4, layout=None, device=torch.device("cpu"), pin_memory=False)
    encoder_attention_mask = torch.to(_5, dtype=4, layout=0, device=torch.device("cpu"))
    _6 = (text_encoder).forward(input_ids, attention_mask, _1, encoder_attention_mask, )
          ~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    _7 = torch.slice(_6, 0, 0, 9223372036854775807)
    input = torch.slice(torch.select(_7, 1, 0), 1, 0, 9223372036854775807)
  File "code/__torch__/models/xbert.py", line 19, in forward
    cls = self.cls
    bert0 = self.bert
    _0 = (bert0).forward(input_ids, attention_mask, argument_3, encoder_attention_mask, )
          ~~~~~~~~~~~~~~ <--- HERE
    _1 = (cls).forward(weight, _0, )
    return _0
  File "code/__torch__/models/xbert.py", line 50, in forward
    _8 = torch.to(encoder_extended_attention_mask, 6)
    attention_mask1 = torch.mul(torch.rsub(_8, 1.), CONSTANTS.c3)
    _9 = (embeddings).forward(input_ids, input, )
          ~~~~~~~~~~~~~~~~~~~ <--- HERE
    _10 = (encoder).forward(_9, attention_mask0, argument_3, attention_mask1, )
    return _10
  File "code/__torch__/models/xbert.py", line 78, in forward
    input0 = torch.slice(_12, 1, 0, _11)
    _13 = (word_embeddings).forward(input_ids, )
    _14 = (token_type_embeddings).forward(input, )
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    embeddings = torch.add(_13, _14)
    _15 = (position_embeddings).forward(input0, )
  File "code/__torch__/torch/nn/modules/sparse/___torch_mangle_164.py", line 10, in forward
    input: Tensor) -> Tensor:
    weight = self.weight
    return torch.embedding(weight, input)
           ~~~~~~~~~~~~~~~ <--- HERE

Traceback of TorchScript, original code (most recent call last):
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/functional.py(2044): embedding
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/sparse.py(158): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(207): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(1046): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/xbert.py(1400): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/model_somatic.py(47): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/ALBEF/models/model_somatic.py(90): forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1090): _slow_forward
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/nn/modules/module.py(1102): _call_impl
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/jit/_trace.py(958): trace_module
/home/lzh/miniconda3/envs/albef/lib/python3.8/site-packages/torch/jit/_trace.py(741): trace
/home/lzh/ALBEF/checkpoint.py(46): main
/home/lzh/ALBEF/checkpoint.py(76): <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

奇怪的是，我在python中加载相应的文件也出现这个问题。

image = torch.rand(16,4,300,201)
text1 =  torch.rand(16,25).long()
text2 = torch.rand(16, 25).long()

traced_script_module = torch.jit.trace(model, (image,text1,text2))
traced_script_module.save('model_scripted.pt')
device=torch.device("cuda:0")
text = torch.ones((1,25))
text = text.long().to(device)
image = torch.ones((1,4,300,201)).to(device)
model = torch.jit.load('model_scripted.pt', map_location=torch.device('cuda'))
model.eval()
for param in model.parameters():
   if param.device.type == 'cuda':
      print('cuda')
print(image.device)
print(text.device)
out = model(image,text,text)

参数的输出为cuda和cuda:0。错误输出与c++相同。我使用链接中提到的方法在我的代码中将模型加载到GPU上，但它仍然不起作用。文字我应该怎么办？这个问题困扰我很久了

Answer 1

我通过首先检查模型代码没有指定创建张量的设备来解决问题；然后保存的时候，把代码放到cuda上再保存模型就解决了。

model.to(device)
image = torch.rand(1,4,300,201).to(device)
text1 =  torch.rand(1,25).long().to(device)
text2 = torch.rand(1, 25).long().to(device)
traced_script_module = torch.jit.trace(model, (image,text1,text2))

使用TorchScript模型并遇到问题：RuntimeError：预期所有张量都在同一设备上，但发现至少两个设备，cuda：0和cpu

问题描述投票：0回答：1

1个回答

最新问题

使用TorchScript模型并遇到问题：RuntimeError：预期所有张量都在同一设备上，但发现至少两个设备，cuda：0和cpu

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1