使用MBart50Tokenizer快速分词器处理多个句子

问题描述 投票:0回答:1

我正在尝试在 GPU 上将 MBart50TokenizerFast 与

facebook/mbart-large-50-many-to-one-mmt
一起使用,并尝试一次性提供多个句子(句子无法组合)。这是我的代码(基于https://stackoverflow.com/a/62688252/194742):

tokenizer.src_lang = source_lang
inputs = tokenizer([title, ftext], return_tensors="pt").to(device)
outputs = model.generate(**inputs).to(device)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
translated_title = translations[0]
translated_ftext = translations[1]

这主要遵循页面上给出的示例,只是我试图一次性包含多个句子。这是我收到的错误消息:

Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

该代码确实适用于此行:

inputs = self.tokenizer(title, return_tensors="pt").to(self.device)

使用多个句子的正确方法是什么?感谢您的指点。

nlp huggingface-transformers huggingface-tokenizers machine-translation
1个回答
0
投票

看起来我必须按照错误消息中的建议启用截断。最终有效的代码:

tokenizer.src_lang = source_lang
inputs = tokenizer([title, ftext], return_tensors="pt", padding=True, truncation=True, max_length=512).to(device)
outputs = model.generate(**inputs, max_length=512)
translations = [self.tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
translated_title = translations[0]
translated_ftext = translations[1]
© www.soinside.com 2019 - 2024. All rights reserved.