所以我有一个数据集,我想删除使用的停止词
stopwords.words('english')
我正在努力如何在我的代码中使用它只是简单地取出这些单词。我已经有了这个数据集中的单词列表,我正在努力的部分是与此列表进行比较并删除停用词。任何帮助表示赞赏。
from nltk.corpus import stopwords
# ...
filtered_words = [word for word in word_list if word not in stopwords.words('english')]
你也可以做一个设置差异,例如:
list(set(nltk.regexp_tokenize(sentence, pattern, gaps=True)) - set(nltk.corpus.stopwords.words('english')))
我想你有一个单词列表(word_list),你想从中删除停用词。你可以这样做:
filtered_word_list = word_list[:] #make a copy of the word_list
for word in word_list: # iterate over word_list
if word in stopwords.words('english'):
filtered_word_list.remove(word) # remove word from filtered_word_list if it is a stopword
要排除所有类型的停用词,包括nltk停用词,你可以这样做:
from stop_words import get_stop_words
from nltk.corpus import stopwords
stop_words = list(get_stop_words('en')) #About 900 stopwords
nltk_words = list(stopwords.words('english')) #About 150 stopwords
stop_words.extend(nltk_words)
output = [w for w in word_list if not w in stop_words]
使用textcleaner库从数据中删除停用词。
点击此链接:https://yugantm.github.io/textcleaner/documentation.html#remove_stpwrds
请按照以下步骤使用此库执行此操作。
pip install textcleaner
安装后:
import textcleaner as tc
data = tc.document(<file_name>)
#you can also pass list of sentences to the document class constructor.
data.remove_stpwrds() #inplace is set to False by default
使用上面的代码删除停用词。
你可以使用这个功能,你应该注意到你需要降低所有单词
from nltk.corpus import stopwords
def remove_stopwords(word_list):
processed_word_list = []
for word in word_list:
word = word.lower() # in case they arenet all lower cased
if word not in stopwords.words("english"):
processed_word_list.append(word)
return processed_word_list
使用filter:
from nltk.corpus import stopwords
# ...
filtered_words = list(filter(lambda word: word not in stopwords.words('english'), word_list))
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x not in list: # comparing from the list and removing it
another_list.append(x) # it is also possible to use .remove
for x in another_list:
print(x,end=' ')
# 2) if you want to use .remove more preferred code
import sys
print ("enter the string from which you want to remove list of stop words")
userstring = input().split(" ")
list =["a","an","the","in"]
another_list = []
for x in userstring:
if x in list:
userstring.remove(x)
for x in userstring:
print(x,end = ' ')
#the code will be like this