我的文字行很长,想读两个单词之间的文字。它是标准格式的大文本文件,如下所示:
First Paragraph
(Empty line)
Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words
(Empty line)
Second Paragraph
(Empty line)
我正在寻找第一段和第二段之间的文本。我尝试了几种使用spacy的方法,但无法获得我想要的。
#Approach 1: This approach doesn't return anything
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'first'}, {'LOWER': 'second'}]
matcher.add("FindParagrpah", None, pattern)
doc = nlp(random_text_from_file_in_String_format)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
#Approach 2: This returns the whole text instead of expected text between First and Second words.
import spacy
nlp = spacy.load("en_core_web_sm")
def custom_boundary(docx):
for token in docx[:-1]:
if token.text == 'Second':
docx[token.i+1].is_sent_start=True
return docx
nlp.add_pipe(custom_boundary,before='parser')
mysentence= nlp(text)
for sentence in mysentence.sents:
print(sentence)
我做错了什么?我应该使用其他图书馆吗?任何帮助,将不胜感激。谢谢。
如果文本数据上有空行;也许您可以使用一些正则表达式:
import re
DATA = """First Paragraph
Random lines are in this file with different words. Some
random lines are in this file with some different words. Some words here
random lines are in this file with various different words. Many words
random lines are in this file with plenty of different words
Second Paragraph
"""
paragraph = re.split(r"\n\n", DATA)
print(paragraph[1])
段落将是一个列表,其中包含3个元素;
如果打印作为段[1]输出的secon元素,将是:
随机行在此文件中使用不同的单词。一些此文件中的随机行带有一些不同的单词。这里有些话此文件中的随机行带有各种不同的单词。很多话此文件中包含许多不同的单词,其中包含随机行