我正在尝试使用 PDF 文件的 SpaCy 解决名称实体识别 (NER) 问题。我想从 pdf 文件中获取情态动词(will、shall、should、must 等)。
我在 spaCy 中训练了数据。当使用经过训练的模态进行预测时,模态的
ent.sent.text
属性通常返回文本或者可以说出标签从中提取的句子。但在我的例子中,它返回标签本身而不是句子。任何人都请帮助我。
代码如下:
def load_training_data_from_csv(file_path):
nlp = spacy.load('en_core_web_md')
train_data = []
with open(file_path, 'r', encoding='cp1252') as f:
reader = csv.DictReader(f)
for row in reader:
sentence = row['text']
start, end = int(row['start']), int(row['end'])
label = row['label']
train_data.append((sentence, {"entities": [(start, end, label)]}))
# Check the alignment
from spacy.training import offsets_to_biluo_tags
doc = nlp.make_doc(sentence)
tags = offsets_to_biluo_tags(doc, [(start, end, label)])
if '-' in tags:
print(f"Warning: Misaligned entities in '{sentence}' with entities {[(start, end, label)]}")
return train_data
def train_spacy_ner(train_data, n_iter=10):
# Load the existing model
nlp = spacy.load('en_core_web_md')
# Add the NER pipeline if it doesn't exist
if "ner" not in nlp.pipe_names:
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner, last=True)
else:
ner = nlp.get_pipe("ner")
# Add the new label "CURRENCY" to the NER model
ner.add_label("WILL")
ner.add_label("SHALL")
ner.add_label("MUST")
# Train the NER model
optimizer = nlp.begin_training()
for i in range(n_iter):
print("Epoch - ", i) if i % 2 == 0 or i == n_iter else None
random.shuffle(train_data)
losses = {}
for text, annotations in train_data:
doc = nlp.make_doc(text)
example = spacy.training.Example.from_dict(doc, annotations)
nlp.update([example], sgd=optimizer, losses=losses)
print("loss : ", losses) if i % 2 == 0 or i == n_iter else None
return nlp
# nlp = spacy.load("en_core_web_md")
file_path = "/content/trainData.csv"
TRAIN_DATA = load_training_data_from_csv(file_path)
# Train the model
nlp = train_spacy_ner(TRAIN_DATA)
nlp.to_disk('custom_NER')
import spacy
nlp = spacy.load('custom_NER')
text = "The language will be in english"
doc = nlp(text)
# print(doc.ents)
for ent in doc.ents:
print(ent.sent.text, ent.start_char, ent.end_char, ent.label_)
ent.sent.text
应该返回上面使用的句子。但在这里,标签本身正在回归。
will 13 17 WILL
The language will be in english 13 17 WILL
文字 | 开始 | 结束 | 标签 |
---|---|---|---|
我来办手续 | 2 | 6 | 会 |
你应该寄信 | 4 | 10 | 应该 |