Text预处理Python

Question

我有文字输入='那只棕色的狐狸。跳过了那只懒狗。我希望输出如下：

[[['quick'，'brown'，'fox'，'。']，['jumped'，'lazy'，'dog'，'。']]

请让我知道如何做。

我只是将句子分成单词，但不确定下一步该怎么做？

import nltk 
from nltk.tokenize import word_tokenize 

input="The quick brown fox. Jumped over the lazy dog." 
tokens=word_tokenize(input) 
print(tokens)

Answer 1

有很多方法可以解决这个问题，但是让我们选择到目前为止的方法。

所以您将句子分成单词，我想您是通过text = text.split(" ")完成的，所以列表看起来像text = ["The", "quick", "brown", "fox.", "Jumped", "over", "the", "lazy", "dog."]

现在让我们在新数组new_list中实现句点。

text = text.split(" ")
new_list = []  # New list we will write the words to

for word in text:
    if '.' in word:
        word = word.split('.')  # Here we assume period always comes after word
        new_list.append(word[0])
        new_list.append('.')
    else:
        new_list.append(word)

现在看来，您不需要诸如“ The”或“ over”之类的词。为此，只需创建另一个数组，例如skip_words = ["The", "the", "over"]。

skip_words = ["The", "the", "over"]
for word in skip_words:
    new_list.remove(word)

这应该可以解决问题！现在，只需尝试打印出new_list。

Text预处理Python

问题描述投票：-1回答：1

1个回答

最新问题

Text预处理Python

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1