我有类似的清单:
mylist = [['how', 'to', 'unlock', 'my', 'bajaj', 'finance', 'emi', 'card'], ['how', 'to', 'unlock', 'my', 'card'], ['how', 'to', 'unlock', 'my', 'card', 'tell', 'me', 'the', 'what', 'next'], ['how', 'to', 'unlock', 'my', 'emi', 'card']]
我想从中删除停用词我的代码是这样的**
stopword = stopwords.word('english')
filtered_data = []
for w in range(0, len(lemmetizeXlist)):
if lemmetizeXlist[w] not in stopword:
filtered_data.append(lemmetizeXlist[w])
基于词性标记(POS标记)进行合法化。
代码:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
from nltk.stem import WordNetLemmatizer
set(stopwords.words('english'))
text = """He determined to drop his litigation with the monastry, and
relinguish his claims to the wood-cuting and
fishery rihgts at once. He was the more ready to do this becuase the
rights had become much less valuable, and he had
indeed the vaguest idea where the wood and river in question were."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print(filtered_sentence)
lemma_word = []
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
for w in filtered_sentence:
word1 = wordnet_lemmatizer.lemmatize(w, pos = "n")
word2 = wordnet_lemmatizer.lemmatize(word1, pos = "v")
word3 = wordnet_lemmatizer.lemmatize(word2, pos = ("a"))
lemma_word.append(word3)
print(lemma_word)
输入:
He determined drop litigation monastry, relinguish claims wood-cuting fishery rihgts. He ready becuase rights become much less valuable, indeed vaguest idea wood river question.
输出:
He determined drop litigation monastry, relinguish claim wood-cuting fishery rihgts. He ready becuase right become much le valuable, indeed vaguest idea wood river question.