基于标点符号的文本分割，尤其是在子句级别

Question

我想在句子或段落中遇到标点符号时对文本进行分段。如果我在正则表达式中使用逗号（，），则还会对各个用逗号分隔的名词动词或形容词进行分块。假设我们有“狗，猫，老鼠和其他动物”。狗变成一个单独的块，我不想发生。无论如何，我可以忽略使用正则表达式或nltk中的任何其他方式（其中我只能将逗号分隔的子句作为文本段）

代码

from nltk import sent_tokenize
import re
text = "Peter Mattei's 'Love in the Time of Money' is a visually stunning film to watch. Mrs. Mattei offers us a vivid portrait about human relations. This is a movie that seems to be telling us what money, power and success do to people in the different situation we encounter.
text= re.sub("(?<=..Dr|.Mrs|..Mr|..Ms|Prof)[.]","<prd>", text)
txt = re.split(r'\.\s|;|:|\?|\'\s|"\s|!|\s\'|\s\"', text)
print(txt)

Answer 1

您要解决的问题在NLP中称为

chunking。传统上，这是基于POS标签的基于正则表达式的算法（因此，您需要先进行POS标签）。 NLTK具有a tutorial for that，但是，这是一种过时的方法。

现在，当快速可靠的标记器和解析器可用时（例如，在Spacy中）。我建议先分析该句子，然后在选区分析中查找块。

基于标点符号的文本分割，尤其是在子句级别

问题描述投票：0回答：1

1个回答

最新问题

基于标点符号的文本分割，尤其是在子句级别

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1