有没有办法在使用huggingface--Helsinki-NLP--opus-mt-ROMANCE-en模型时防止NLP翻译掉表情符号?期望的行为:
"
Bonjour ma France 🇫🇷
"@fr --> "Hello my France 🇫🇷
"@en
我可以看到预训练的默认分词器知道其词汇表中的表情符号,但在解码之前将其丢失。示例代码:
import transformers
# setup
engine = 'pt'
resource = 'huggingface--Helsinki-NLP--opus-mt-ROMANCE-en'
nlp = transformers.pipeline(
task="translation",
model= transformers.MarianMTModel.from_pretrained(resource),
tokenizer= transformers.AutoTokenizer.from_pretrained(resource),
framework=engine
)
# infer
translated = nlp.tokenizer.batch_decode(
skip_special_tokens=True,
sequences=nlp.model.generate(
**nlp.tokenizer(
text=["Bonjour ma France 🇫🇷"],
return_tensors=engine
)
)
)
# results
print(translated) # ['Hello, my France.']
print("🇫🇷" in nlp.tokenizer.get_vocab()) # True