TFIDVECTORIZER：所有文本均为停用词，导致错误

Question

我正在运行此代码

from sklearn.feature_extraction.text import TfidfVectorizer

def lemmatizer(text):
    return [word.lemma_ for word in nlp(text)]

# we need to generate the lemmas of the stop words
stop_words_str = " ".join(STOP_WORDS) # nlp function needs a string
stop_words_lemma = set(word.lemma_ for word in nlp(stop_words_str))

tfidf_lemma = TfidfVectorizer(max_features=100, 
                              stop_words=stop_words_lemma.union({"pax", "west", "hyatt", "wscc","borderlands"}),
                                tokenizer=lemmatizer)

tfidf_lemma.fit(documents)
print(tfidf_lemma.get_feature_names())

并且我收到以下错误：

ValueError: np.nan is an invalid document, expected byte or unicode string.

我怀疑这是因为我正在处理的某些响应完全是停用词。我正在使用Spacy的停用词。

from spacy.lang.en import STOP_WORDS

我已经阅读了一些答复，并且有些类似“每个人都有的东西”，我相信当用停用词过滤时，会变成NaN导致错误。有什么好的解决办法？

Answer 1

是一个愚蠢的问题。实际上，我在数据准备过程中出错了，并且错过了documents中的NaN值，因为在将dataframe列转换为list之前我忘记了使用.dropna()。

TFIDVECTORIZER：所有文本均为停用词，导致错误

问题描述投票：0回答：1

1个回答

最新问题

TFIDVECTORIZER：所有文本均为停用词，导致错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1