tfidf首次用于在每个条目都有列表的熊猫系列中

问题描述 投票:0回答:1

数据看起来像这样:

data_clean2.head(3)

text    target
0   [deed, reason, earthquak, may, allah, forgiv, u]    1
1   [forest, fire, near, la, rong, sask, canada]    1
2   [resid, ask, shelter, place, notifi, offic, evacu, shelter, place, order, expect]   1

我通过在句子之前加上词干和词根限制并在其上进行标记化来获得此标记。 (希望是对的。)>

现在我要使用:

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data_clean2['text'])

它给我以下错误:


AttributeError                            Traceback (most recent call last)
<ipython-input-140-6f68d1115c5f> in <module>
      1 vectorizer = TfidfVectorizer()
----> 2 vectors = vectorizer.fit_transform(data_clean2['text'])

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1650         """
   1651         self._check_params()
-> 1652         X = super().fit_transform(raw_documents)
   1653         self._tfidf.fit(X)
   1654         # X is already a transformed view of raw_documents so

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in fit_transform(self, raw_documents, y)
   1056 
   1057         vocabulary, X = self._count_vocab(raw_documents,
-> 1058                                           self.fixed_vocabulary_)
   1059 
   1060         if self.binary:

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in _count_vocab(self, raw_documents, fixed_vocab)
    968         for doc in raw_documents:
    969             feature_counter = {}
--> 970             for feature in analyze(doc):
    971                 try:
    972                     feature_idx = vocabulary[feature]

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(doc)
    350                                                tokenize)
    351             return lambda doc: self._word_ngrams(
--> 352                 tokenize(preprocess(self.decode(doc))), stop_words)
    353 
    354         else:

~\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py in <lambda>(x)
    254 
    255         if self.lowercase:
--> 256             return lambda x: strip_accents(x.lower())
    257         else:
    258             return strip_accents

AttributeError: 'list' object has no attribute 'lower'

我知道我无法以某种方式在列表上使用它,所以我在这里扮演什么角色,试图再次将列表返回到字符串中?

[数据看起来像这样:data_clean2.head(3)文本目标0 [行为,原因,地震,可能,安拉,宽恕,你] 1 1 [森林,大火,附近,拉,荣,萨斯省,加拿大] 1 2 [残留物,询问,住所,...

scikit-learn nltk tf-idf
1个回答
0
投票

是,首先使用以下方法转换为string

© www.soinside.com 2019 - 2024. All rights reserved.