ValueError:空词汇;也许文档只包含停用词

问题描述 投票:0回答:2

我正在(第一次)使用 scikit 库,我得到了这个错误:

ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:\Users\A605563\Desktop\velibProjetPreso\TraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words

但我不明白为什么会这样。

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk


TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets =  TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
    f2.write(line.replace('\n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()

我的数据是原始推文:

11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11,    . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"

有人知道这里发生了什么吗?

python twitter scikit-learn
2个回答
0
投票

这是一个更简单的解决方案:

x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)

0
投票

我找到了解决方案:

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk 
from io import StringIO


TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape
© www.soinside.com 2019 - 2024. All rights reserved.