使用 SpaCy 匹配器获取上一句话

问题描述 投票:0回答:1

我正在文本文件上逐行运行 SpaCy Matcher。我的文件中的每个文本条目都位于单独的行上。我试图提取 1) 匹配的实例,2) 完整的句子,3) 前一个句子。我能够得到前两个,但我无法得到前一句话,因为没有句子索引(来自这篇文章)。这是我的代码:

with open('file.txt', 'r') as f:
    for line in iter(f.readline, ''):
        doc = nlp(line)
        matcher = Matcher(nlp.vocab)
        matcher.add("pattern_of_interest", [pattern])
        matches = matcher(doc)
    
        for match_id, start, end in matches:
            string_id = nlp.vocab.strings[match_id]
            span = doc[start:end] 
            
        for sent in doc.sents:
            if matcher(sent):
                instances.append(pd.Series({"instance":str(span.text), 
                                        "sentence":str(sent.text),
                                        "previous_sentence":str(sent[-1].text)}))

我知道粗体部分给了我前一个标记,而不是句子(我试图用列表来解决这个问题,但它不起作用)。任何有关检索前一句的建议将不胜感激。谢谢!

python nlp spacy
1个回答
0
投票

调整: 跟踪上一句:我们现在维护一个 prev_sent 变量,用于在迭代文档中的所有句子时跟踪上一句。 Matcher 用法: 我们只需要在循环外部通过文件中的行创建一次 Matcher 实例,然后将其应用于循环内的每个句子。这比为每行重新创建它更有效。 检查前一句:我们通过检查 prev_sent 是否为 None 来处理可能没有前一句的情况(例如,在文档的第一句中找到匹配项)。如果是,我们将“previous_sentence”字段设置为“N/A”或您认为合适的任何占位符文本。

import spacy
from spacy.matcher import Matcher
import pandas as pd

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")  # Adjust model as necessary

# Define your pattern here
pattern = [{"LOWER": "example"}]  # Example pattern

# Initialize matcher with the vocab
matcher = Matcher(nlp.vocab)
matcher.add("pattern_of_interest", [pattern])

instances = []  # List to hold match details

with open('file.txt', 'r') as f:
    for line in iter(f.readline, ''):
        doc = nlp(line)
        
        prev_sent = None  # Variable to keep track of the previous sentence
        for sent in doc.sents:
            matches = matcher(sent)
            if matches:
                for match_id, start, end in matches:
                    instance_text = sent[start:end].text  # The matched instance
                    current_sentence = sent.text
                    previous_sentence = prev_sent.text if prev_sent else "N/A"  # Handle the case where there's no previous sentence
                    
                    # Append the extracted information to your instances list
                    instances.append(pd.Series({"instance": instance_text, 
                                                 "sentence": current_sentence,
                                                 "previous_sentence": previous_sentence}))
            prev_sent = sent  # Update the previous sentence for the next iteration

# Convert instances list to DataFrame
df_instances = pd.DataFrame(instances)
print(df_instances)
© www.soinside.com 2019 - 2024. All rights reserved.