我是Data Science和Python的新手,但正在尝试使用sklearn的CountVectorizer。我试图摆脱出现在我拥有的90%或更多文档中的单词,并使用以下代码:
df = movies2['overview'].values.astype(str)
cv = CountVectorizer(df, max_df = 0.9)
count_vector = cv.fit_transform(df)
我尝试用预定义的停用词替换max_df = 0.9,但都抛出以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-114-91468fdad6f6> in <module>
1 df = movies2['overview'].values.astype(str)
2 cv = CountVectorizer(df, max_df = 0.9)
----> 3 count_vector = cv.fit_transform(df)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
1056
1057 vocabulary, X = self._count_vocab(raw_documents,
-> 1058 self.fixed_vocabulary_)
1059
1060 if self.binary:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
968 for doc in raw_documents:
969 feature_counter = {}
--> 970 for feature in analyze(doc):
971 try:
972 feature_idx = vocabulary[feature]
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
350 tokenize)
351 return lambda doc: self._word_ngrams(
--> 352 tokenize(preprocess(self.decode(doc))), stop_words)
353
354 else:
~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in decode(self, doc)
130 The string to decode
131 """
--> 132 if self.input == 'filename':
133 with open(doc, 'rb') as fh:
134 doc = fh.read()
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
我认为问题可能是每个文档可能都有多个停用词,例如“ the”-但我不确定如何更正此错误。任何帮助将不胜感激!
我认为问题是您在初始化CountVectorizer时正在传递numpy数组,并且在正式文档中写道,您需要传递可以是字符串或字节类型的项目序列。https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer
这可以通过两种方式解决:
第一种方法是在没有输入的情况下初始化CountVectorizer,如下所示:
df = movies2['overview'].values.astype(str)
cv = CountVectorizer(max_df = 0.9)
count_vector = cv.fit_transform(df)
其次将numpy数组转换为列表,然后再将其作为输入传递给CountVectorizer:
df = list(movies2['overview'].values.astype(str))
cv = CountVectorizer(df,max_df = 0.9)
count_vector = cv.fit_transform(df)
希望这可以解决您的问题。