如何在Python中创建单词袋

Question

Dataframe test]在我清理并标记了它之后。

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
test['tokenize'] = test['tweet'].apply(tt.tokenize)
print(test)

输出

0  congratulations dear friend ... [congratulations, dear, friend]
1  happy anniversary be happy  ... [happy, anniversary, be, happy]
2  make some sandwich          ...          [make, some, sandwich]

我想为我的数据创建一个词袋。以下给了我错误：'list'对象没有属性'lower'

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()

BOW = vectorizer.fit_transform(test['tokenize'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

第二个：AttributeError：“列表”对象没有属性“拆分”

from collections import Counter
test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))
print(test['BOW'])

您能帮我一种方法还是两种方法都可以。谢谢！

Answer 1

如下所示，您的输出示例中，test ['tokenize']在单元格中包含列表。这些列表是通过用“”分割而从字符串中检索到的值，因此要使此行test['BOW'] = test['tokenize'].apply(lambda x: Counter(x.split(" ")))有效，请尝试将其更改为test['BOW'] = test['tokenize'].apply(lambda x: Counter(x))

Answer 2

vectorizer.fit_transform将str，unicode或文件对象的可迭代对象作为参数。您传递了一个可迭代的列表（带标记字符串的列表）。您可以只传递原始字符串集test['tweet']，因为CountVectorizer会为您代币化。

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
BOW = vectorizer.fit_transform(test['tweet'])
print(BOW.toarray())
print(vectorizer.get_feature_names())

这应该给您预期的输出。

如何在Python中创建单词袋

问题描述投票：0回答：2

2个回答

最新问题

如何在Python中创建单词袋

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2