ent.sent.text in spacy 返回标签而不是 NER 问题的句子

问题描述 投票:0回答:0

我正在尝试使用 PDF 文件的 SpaCy 解决名称实体识别 (NER) 问题。我想从 pdf 文件中获取情态动词(will、shall、should、must 等)。

我在 spaCy 中训练了数据。当使用经过训练的模态进行预测时,模态的

ent.sent.text
属性通常返回文本或者可以说出标签从中提取的句子。但在我的例子中,它返回标签本身而不是句子。任何人都请帮助我。

代码如下:

数据准备代码

def load_training_data_from_csv(file_path):
    nlp = spacy.load('en_core_web_md')
    train_data = []
    with open(file_path, 'r', encoding='cp1252') as f:
        reader = csv.DictReader(f)
        for row in reader:
            sentence = row['text']
            start, end = int(row['start']), int(row['end'])
            label = row['label']
            train_data.append((sentence, {"entities": [(start, end, label)]}))
            # Check the alignment
            from spacy.training import offsets_to_biluo_tags
            doc = nlp.make_doc(sentence)
            tags = offsets_to_biluo_tags(doc, [(start, end, label)])
            if '-' in tags:
                print(f"Warning: Misaligned entities in '{sentence}' with entities {[(start, end, label)]}")
    return train_data

训练模型

def train_spacy_ner(train_data, n_iter=10):
    # Load the existing model
    nlp = spacy.load('en_core_web_md')

    # Add the NER pipeline if it doesn't exist
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe("ner")


    # Add the new label "CURRENCY" to the NER model
    ner.add_label("WILL")
    ner.add_label("SHALL")
    ner.add_label("MUST")


    # Train the NER model
    optimizer = nlp.begin_training()
    for i in range(n_iter):
        print("Epoch - ", i) if i % 2 == 0 or i == n_iter else None
        random.shuffle(train_data)
        losses = {}
        for text, annotations in train_data:
            doc = nlp.make_doc(text)
            example = spacy.training.Example.from_dict(doc, annotations)
            nlp.update([example], sgd=optimizer, losses=losses)
        print("loss : ", losses) if i % 2 == 0 or i == n_iter else None

    return nlp

调用函数

# nlp = spacy.load("en_core_web_md")
file_path = "/content/trainData.csv"
TRAIN_DATA = load_training_data_from_csv(file_path)

# Train the model
nlp = train_spacy_ner(TRAIN_DATA)
nlp.to_disk('custom_NER')

使用模型预测(这里是问题的开始)

import spacy

nlp = spacy.load('custom_NER')
text = "The language will be in english"

doc = nlp(text)
# print(doc.ents)
for ent in doc.ents:
  print(ent.sent.text, ent.start_char, ent.end_char, ent.label_)

ent.sent.text
应该返回上面使用的句子。但在这里,标签本身正在回归。

输出获取

will 13 17 WILL

期待输出

The language will be in english 13 17 WILL

火车数据

文字 开始 结束 标签
我来办手续 2 6
你应该寄信 4 10 应该
python machine-learning nlp spacy named-entity-recognition
© www.soinside.com 2019 - 2024. All rights reserved.