NER 的 Transformer Pipeline 返回带有 ##s 的部分单词

问题描述 投票:0回答:2

我应该如何解释 Transformer NER 管道返回的带有“##”的部分单词? Flair 和 SpaCy 等其他工具返回单词及其标签。我以前使用过 CONLL 数据集,但从未注意到类似的情况。而且,为什么要这样划分单词?

HuggingFace 的示例:

from transformers import pipeline

nlp = pipeline("ner")

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge which is visible from the window."

print(nlp(sequence))

输出:

[
    {'word': 'Hu', 'score': 0.9995632767677307, 'entity': 'I-ORG'},
    {'word': '##gging', 'score': 0.9915938973426819, 'entity': 'I-ORG'},
    {'word': 'Face', 'score': 0.9982671737670898, 'entity': 'I-ORG'},
    {'word': 'Inc', 'score': 0.9994403719902039, 'entity': 'I-ORG'},
    {'word': 'New', 'score': 0.9994346499443054, 'entity': 'I-LOC'},
    {'word': 'York', 'score': 0.9993270635604858, 'entity': 'I-LOC'},
    {'word': 'City', 'score': 0.9993864893913269, 'entity': 'I-LOC'},
    {'word': 'D', 'score': 0.9825621843338013, 'entity': 'I-LOC'},
    {'word': '##UM', 'score': 0.936983048915863, 'entity': 'I-LOC'},
    {'word': '##BO', 'score': 0.8987102508544922, 'entity': 'I-LOC'},
    {'word': 'Manhattan', 'score': 0.9758241176605225, 'entity': 'I-LOC'},
    {'word': 'Bridge', 'score': 0.990249514579773, 'entity': 'I-LOC'}
]
python pytorch named-entity-recognition huggingface-transformers
2个回答
3
投票

Pytorch Transformer 和 BERT 制作 2 个 token,常规单词作为 token,单词 + 子单词作为 token;将单词除以基本含义+补语,在开头添加“##”。

假设您有以下短语:

I like hugging animals

第一组令牌是:

["I", "like", "hugging", "animals"]

第二个包含子词的列表是:

["I", "like", "hug", "##gging", "animal", "##s"]

您可以在这里了解更多信息: https://www.kaggle.com/funtowiczmo/hugging-face-tutorials-training-tokenizer


0
投票

使用

aggregation_strategy
对实体进行分组:

pipeline('ner', model="YOUR_MODEL", aggregation_strategy="average")

在此处阅读有关策略的更多信息。

© www.soinside.com 2019 - 2024. All rights reserved.