Text预处理Python

问题描述 投票:-1回答:1

我有文字输入='那只棕色的狐狸。跳过了那只懒狗。我希望输出如下:

[[['quick','brown','fox','。'],['jumped','lazy','dog','。']]

请让我知道如何做。

我只是将句子分成单词,但不确定下一步该怎么做?

import nltk 
from nltk.tokenize import word_tokenize 

input="The quick brown fox. Jumped over the lazy dog." 
tokens=word_tokenize(input) 
print(tokens)
python nltk text-processing
1个回答
0
投票

有很多方法可以解决这个问题,但是让我们选择到目前为止的方法。

所以您将句子分成单词,我想您是通过text = text.split(" ")完成的,所以列表看起来像text = ["The", "quick", "brown", "fox.", "Jumped", "over", "the", "lazy", "dog."]

现在让我们在新数组new_list中实现句点。

text = text.split(" ")
new_list = []  # New list we will write the words to

for word in text:
    if '.' in word:
        word = word.split('.')  # Here we assume period always comes after word
        new_list.append(word[0])
        new_list.append('.')
    else:
        new_list.append(word)

现在看来,您不需要诸如“ The”或“ over”之类的词。为此,只需创建另一个数组,例如skip_words = ["The", "the", "over"]

skip_words = ["The", "the", "over"]
for word in skip_words:
    new_list.remove(word)

这应该可以解决问题!现在,只需尝试打印出new_list

© www.soinside.com 2019 - 2024. All rights reserved.