ent.sent.text in spacy 返回标签而不是 NER 问题的句子

Question

我正在尝试使用 PDF 文件的 SpaCy 解决名称实体识别 (NER) 问题。我想从 pdf 文件中获取情态动词（will、shall、should、must 等）。

我在 spaCy 中训练了数据。当使用经过训练的模态进行预测时，模态的

ent.sent.text

属性通常返回文本或者可以说出标签从中提取的句子。但在我的例子中，它返回标签本身而不是句子。任何人都请帮助我。

代码如下：

数据准备代码

def load_training_data_from_csv(file_path):
    nlp = spacy.load('en_core_web_md')
    train_data = []
    with open(file_path, 'r', encoding='cp1252') as f:
        reader = csv.DictReader(f)
        for row in reader:
            sentence = row['text']
            start, end = int(row['start']), int(row['end'])
            label = row['label']
            train_data.append((sentence, {"entities": [(start, end, label)]}))
            # Check the alignment
            from spacy.training import offsets_to_biluo_tags
            doc = nlp.make_doc(sentence)
            tags = offsets_to_biluo_tags(doc, [(start, end, label)])
            if '-' in tags:
                print(f"Warning: Misaligned entities in '{sentence}' with entities {[(start, end, label)]}")
    return train_data

训练模型

def train_spacy_ner(train_data, n_iter=10):
    # Load the existing model
    nlp = spacy.load('en_core_web_md')

    # Add the NER pipeline if it doesn't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe("ner")


    # Add the new label "CURRENCY" to the NER model
    ner.add_label("WILL")
    ner.add_label("SHALL")
    ner.add_label("MUST")


    # Train the NER model
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        print("Epoch - ", i) if i % 2 == 0 or i == n_iter else None
        random.shuffle(train_data)
        losses = {}
        for text, annotations in train_data:
            doc = nlp.make_doc(text)
            example = spacy.training.Example.from_dict(doc, annotations)
            nlp.update([example], sgd=optimizer, losses=losses)
        print("loss : ", losses) if i % 2 == 0 or i == n_iter else None

    return nlp

调用函数

# nlp = spacy.load("en_core_web_md")
file_path = "/content/trainData.csv"
TRAIN_DATA = load_training_data_from_csv(file_path)

# Train the model
nlp = train_spacy_ner(TRAIN_DATA)
nlp.to_disk('custom_NER')

使用模型预测（这里是问题的开始）

import spacy

nlp = spacy.load('custom_NER')
text = "The language will be in english"

doc = nlp(text)
# print(doc.ents)
for ent in doc.ents:
  print(ent.sent.text, ent.start_char, ent.end_char, ent.label_)

ent.sent.text

应该返回上面使用的句子。但在这里，标签本身正在回归。

输出获取

will 13 17 WILL

期待输出

The language will be in english 13 17 WILL

火车数据

文字	开始	结束	标签
我来办手续	2	6	会
你应该寄信	4	10	应该

ent.sent.text in spacy 返回标签而不是 NER 问题的句子

问题描述投票：0回答：0

数据准备代码

训练模型

调用函数

使用模型预测（这里是问题的开始）

输出获取

期待输出

火车数据

最新问题

ent.sent.text in spacy 返回标签而不是 NER 问题的句子

问题描述 投票：0回答：0

数据准备代码

训练模型

调用函数

使用模型预测（这里是问题的开始）

输出获取

期待输出

火车数据

最新问题

问题描述投票：0回答：0