[使用基于规则的处理在Python 3中读取开头和结尾之间的行

问题描述 投票:0回答:1

我的文字行很长,想读两个单词之间的文字。它是标准格式的大文本文件,如下所示:

First Paragraph
(Empty line)
Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words
(Empty line)    
Second Paragraph
(Empty line)

我正在寻找第一段和第二段之间的文本。我尝试了几种使用spacy的方法,但无法获得我想要的。

#Approach 1:  This approach doesn't return anything

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [{'LOWER': 'first'}, {'LOWER': 'second'}]

matcher.add("FindParagrpah", None, pattern)

doc = nlp(random_text_from_file_in_String_format)

matches = matcher(doc)
for match_id, start, end in matches:
  string_id = nlp.vocab.strings[match_id]  # Get string representation
  span = doc[start:end]  # The matched span
  print(match_id, string_id, start, end, span.text)

#Approach 2: This returns the whole text instead of expected text between First and Second words.

import spacy

nlp = spacy.load("en_core_web_sm")

def custom_boundary(docx):
for token in docx[:-1]:
  if token.text == 'Second':
  docx[token.i+1].is_sent_start=True
return docx

nlp.add_pipe(custom_boundary,before='parser')

mysentence= nlp(text)

for sentence in mysentence.sents:
 print(sentence)

我做错了什么?我应该使用其他图书馆吗?任何帮助,将不胜感激。谢谢。

python python-3.x nltk spacy
1个回答
0
投票

如果文本数据上有空行;也许您可以使用一些正则表达式:

import re

DATA = """First Paragraph

Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words

Second Paragraph
"""

paragraph = re.split(r"\n\n", DATA)
print(paragraph[1])

段落将是一个列表,其中包含3个元素;

如果打印作为段[1]输出的secon元素,将是:

随机行在此文件中使用不同的单词。一些此文件中的随机行带有一些不同的单词。这里有些话此文件中的随机行带有各种不同的单词。很多话此文件中包含许多不同的单词,其中包含随机行

© www.soinside.com 2019 - 2024. All rights reserved.