CountVectorizer sklearn-停用词会产生错误

问题描述 投票:0回答:1

我是Data Science和Python的新手,但正在尝试使用sklearn的CountVectorizer。我试图摆脱出现在我拥有的90%或更多文档中的单词,并使用以下代码:

df = movies2['overview'].values.astype(str)
cv = CountVectorizer(df, max_df = 0.9)
count_vector = cv.fit_transform(df)

我尝试用预定义的停用词替换max_df = 0.9,但都抛出以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-114-91468fdad6f6> in <module>
      1 df = movies2['overview'].values.astype(str)
      2 cv = CountVectorizer(df, max_df = 0.9)
----> 3 count_vector = cv.fit_transform(df)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1056 
   1057         vocabulary, X = self._count_vocab(raw_documents,
-> 1058                                           self.fixed_vocabulary_)
   1059 
   1060         if self.binary:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    968         for doc in raw_documents:
    969             feature_counter = {}
--> 970             for feature in analyze(doc):
    971                 try:
    972                     feature_idx = vocabulary[feature]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    350                                                tokenize)
    351             return lambda doc: self._word_ngrams(
--> 352                 tokenize(preprocess(self.decode(doc))), stop_words)
    353 
    354         else:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in decode(self, doc)
    130             The string to decode
    131         """
--> 132         if self.input == 'filename':
    133             with open(doc, 'rb') as fh:
    134                 doc = fh.read()

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

我认为问题可能是每个文档可能都有多个停用词,例如“ the”-但我不确定如何更正此错误。任何帮助将不胜感激!

scikit-learn stop-words countvectorizer
1个回答
0
投票

我认为问题是您在初始化CountVectorizer时正在传递numpy数组,并且在正式文档中写道,您需要传递可以是字符串或字节类型的项目序列。https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

这可以通过两种方式解决:

第一种方法是在没有输入的情况下初始化CountVectorizer,如下所示:

df = movies2['overview'].values.astype(str)
cv = CountVectorizer(max_df = 0.9)
count_vector = cv.fit_transform(df)

其次将numpy数组转换为列表,然后再将其作为输入传递给CountVectorizer:

df = list(movies2['overview'].values.astype(str))
cv = CountVectorizer(df,max_df = 0.9)
count_vector = cv.fit_transform(df)

希望这可以解决您的问题。

© www.soinside.com 2019 - 2024. All rights reserved.