nlp 翻译保留表情符号

问题描述 投票:0回答:0

有没有办法在使用huggingface--Helsinki-NLP--opus-mt-ROMANCE-en模型时防止NLP翻译掉表情符号?期望的行为:

"

Bonjour ma France 🇫🇷
"@fr --> "
Hello my France 🇫🇷
"@en

我可以看到预训练的默认分词器知道其词汇表中的表情符号,但在解码之前将其丢失。示例代码:

import transformers

# setup
engine = 'pt'
resource = 'huggingface--Helsinki-NLP--opus-mt-ROMANCE-en'
nlp = transformers.pipeline(
   task="translation",
   model= transformers.MarianMTModel.from_pretrained(resource),
   tokenizer= transformers.AutoTokenizer.from_pretrained(resource),
   framework=engine
)

# infer
translated = nlp.tokenizer.batch_decode(
   skip_special_tokens=True,
   sequences=nlp.model.generate(
      **nlp.tokenizer(
         text=["Bonjour ma France 🇫🇷"],
         return_tensors=engine
      )
   )
)

# results
print(translated) # ['Hello, my France.']
print("🇫🇷" in nlp.tokenizer.get_vocab()) # True
python translation emoji huggingface
© www.soinside.com 2019 - 2024. All rights reserved.