我正在文本文件上逐行运行 SpaCy Matcher。我的文件中的每个文本条目都位于单独的行上。我试图提取 1) 匹配的实例,2) 完整的句子,3) 前一个句子。我能够得到前两个,但我无法得到前一句话,因为没有句子索引(来自这篇文章)。这是我的代码:
with open('file.txt', 'r') as f:
for line in iter(f.readline, ''):
doc = nlp(line)
matcher = Matcher(nlp.vocab)
matcher.add("pattern_of_interest", [pattern])
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
for sent in doc.sents:
if matcher(sent):
instances.append(pd.Series({"instance":str(span.text),
"sentence":str(sent.text),
"previous_sentence":str(sent[-1].text)}))
我知道粗体部分给了我前一个标记,而不是句子(我试图用列表来解决这个问题,但它不起作用)。任何有关检索前一句的建议将不胜感激。谢谢!
调整: 跟踪上一句:我们现在维护一个 prev_sent 变量,用于在迭代文档中的所有句子时跟踪上一句。 Matcher 用法: 我们只需要在循环外部通过文件中的行创建一次 Matcher 实例,然后将其应用于循环内的每个句子。这比为每行重新创建它更有效。 检查前一句:我们通过检查 prev_sent 是否为 None 来处理可能没有前一句的情况(例如,在文档的第一句中找到匹配项)。如果是,我们将“previous_sentence”字段设置为“N/A”或您认为合适的任何占位符文本。
import spacy
from spacy.matcher import Matcher
import pandas as pd
# Load the SpaCy model
nlp = spacy.load("en_core_web_sm") # Adjust model as necessary
# Define your pattern here
pattern = [{"LOWER": "example"}] # Example pattern
# Initialize matcher with the vocab
matcher = Matcher(nlp.vocab)
matcher.add("pattern_of_interest", [pattern])
instances = [] # List to hold match details
with open('file.txt', 'r') as f:
for line in iter(f.readline, ''):
doc = nlp(line)
prev_sent = None # Variable to keep track of the previous sentence
for sent in doc.sents:
matches = matcher(sent)
if matches:
for match_id, start, end in matches:
instance_text = sent[start:end].text # The matched instance
current_sentence = sent.text
previous_sentence = prev_sent.text if prev_sent else "N/A" # Handle the case where there's no previous sentence
# Append the extracted information to your instances list
instances.append(pd.Series({"instance": instance_text,
"sentence": current_sentence,
"previous_sentence": previous_sentence}))
prev_sent = sent # Update the previous sentence for the next iteration
# Convert instances list to DataFrame
df_instances = pd.DataFrame(instances)
print(df_instances)