如何从使用nltk停用词的列表中的标记组成的子列表中删除停用词

问题描述 投票:0回答:1

我有类似的清单:

mylist = [['how', 'to', 'unlock', 'my', 'bajaj', 'finance', 'emi', 'card'], ['how', 'to', 'unlock', 'my', 'card'], ['how', 'to', 'unlock', 'my', 'card', 'tell', 'me', 'the', 'what', 'next'], ['how', 'to', 'unlock', 'my', 'emi', 'card']]

我想从中删除停用词我的代码是这样的**

stopword = stopwords.word('english')
filtered_data = []
for w in range(0, len(lemmetizeXlist)):
    if lemmetizeXlist[w] not in stopword:
        filtered_data.append(lemmetizeXlist[w])
python list nltk stop-words
1个回答
0
投票

基于词性标记(POS标记)进行合法化。

代码:

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
from nltk.stem import WordNetLemmatizer
set(stopwords.words('english'))

text = """He determined to drop his litigation with the monastry, and 
       relinguish his claims to the wood-cuting and 
       fishery rihgts at once. He was the more ready to do this becuase the 
       rights had become much less valuable, and he had 
       indeed the vaguest idea where the wood and river in question were."""

stop_words = set(stopwords.words('english')) 

word_tokens = word_tokenize(text) 

filtered_sentence = [] 

for w in word_tokens: 
    if w not in stop_words: 
        filtered_sentence.append(w) 
print(filtered_sentence) 

lemma_word = []
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
for w in filtered_sentence:
    word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
    word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
    word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
    lemma_word.append(word3)
print(lemma_word)

输入:

He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.

输出:

He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase right become much le valuable, indeed vaguest idea wood river question.
© www.soinside.com 2019 - 2024. All rights reserved.