如何使用nltk或python删除停用词

Question

所以我有一个数据集，我想删除使用的停止词

stopwords.words('english')

我正在努力如何在我的代码中使用它只是简单地取出这些单词。我已经有了这个数据集中的单词列表，我正在努力的部分是与此列表进行比较并删除停用词。任何帮助表示赞赏。

Answer 1

from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]

Answer 2

你也可以做一个设置差异，例如：

list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))

Answer 3

我想你有一个单词列表（word_list），你想从中删除停用词。你可以这样做：

filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
  if word in stopwords.words('english'): 
    filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword

Answer 4

要排除所有类型的停用词，包括nltk停用词，你可以这样做：

from stop_words import get_stop_words
from nltk.corpus import stopwords

stop_words = list(get_stop_words('en'))         #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)

output = [w for w in word_list if not w in stop_words]

Answer 5

使用textcleaner库从数据中删除停用词。

点击此链接：https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds

请按照以下步骤使用此库执行此操作。

pip install textcleaner

安装后：

import textcleaner as tc
data = tc.document(<file_name>) 
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default

使用上面的代码删除停用词。

Answer 6

你可以使用这个功能，你应该注意到你需要降低所有单词

from nltk.corpus import stopwords

def remove_stopwords(word_list):
        processed_word_list = []
        for word in word_list:
            word = word.lower() # in case they arenet all lower cased
            if word not in stopwords.words("english"):
                processed_word_list.append(word)
        return processed_word_list

Answer 7

使用filter：

from nltk.corpus import stopwords
# ...  
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))

Answer 8

   import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
    if x not in list:           # comparing from the list and removing it
        another_list.append(x)  # it is also possible to use .remove
for x in another_list:
     print(x,end=' ')

   # 2) if you want to use .remove more preferred code
    import sys
    print ("enter the string from which you want to remove list of stop words")
    userstring = input().split(" ")
    list =["a","an","the","in"]
    another_list = []
    for x in userstring:
        if x in list:           
            userstring.remove(x)  
    for x in userstring:           
        print(x,end = ' ') 
    #the code will be like this

如何使用nltk或python删除停用词

问题描述投票：94回答：8

8个回答

最新问题

如何使用nltk或python删除停用词

问题描述 投票：94回答：8

8个回答

最新问题

问题描述投票：94回答：8