我曾尝试使用基于“卡门贝特”模型的法语将变压器用于NER。我从https://huggingface.co/transformers/usage.html碰到了这段代码。不幸的是,我的短句的预测结果并不令人满意,我无法理解我的代码是否有问题。
import torch
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("camembert-base")
tokenizer = AutoTokenizer.from_pretrained("camembert-base")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Paris, capitale de la France, est une grande ville européenne et un centre mondial de l'art, de la mode, de la gastronomie et de la culture."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
预测的输出:
[('<s>', 'O'), ('▁Paris', 'O'), (',', 'O'), ('▁capitale', 'B-MISC'), ('▁de', 'O'), ('▁la', 'B-MISC'), ('▁France', 'O'), (',', 'O'), ('▁est', 'B-MISC'), ('▁une', 'O'), ('▁grande', 'O'), ('▁ville', 'O'), ('▁européenne', 'O'), ('▁et', 'O'), ('▁un', 'O'), ('▁centre', 'O'), ('▁mondial', 'O'), ('▁de', 'O'), ('▁l', 'O'), ("'", 'O'), ('art', 'O'), (',', 'O'), ('▁de', 'O'), ('▁la', 'B-MISC'), ('▁mode', 'B-MISC'), (',', 'O'), ('▁de', 'O'), ('▁la', 'B-MISC'), ('▁gastronomie', 'O'), ('▁et', 'O'), ('▁de',
'O'), ('▁la', 'O'), ('▁culture', 'O'), ('.', 'O'), ('</s>', 'O')]`
您应该搜索法语的https://huggingface.co/models?search=conll03。
您可能仅为了检查是否已针对NER任务微调了模型而创建问题。
您的最后一个分类应具有
Linear(in_features=768, out_features=9, bias=True)
结尾。
使用modeel "camembert-base"
,您只有2个输出功能。