我有一个很大的语料库(大约 40 万个独特的句子)。我只想获得每个单词的 TF-IDF 分数。我试图通过扫描每个单词并计算频率来计算每个单词的分数,但是它花费的时间太长了。
我用过:
X= tfidfVectorizer(corpus)
来自 sklearn,但它直接返回句子的向量表示。有什么办法可以获得语料库中每个单词的 TF-IDF 分数吗?
sklearn.feature_extraction.text.TfidfVectorizer
(取自文档):
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
... 'This is the first document.',
... 'This document is the second document.',
... 'And this is the third one.',
... 'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)
现在,如果我打印
X.toarray()
:
[[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]
[0. 0.6876236 0. 0.28108867 0. 0.53864762
0.28108867 0. 0.28108867]
[0.51184851 0. 0. 0.26710379 0.51184851 0.
0.26710379 0.51184851 0.26710379]
[0. 0.46979139 0.58028582 0.38408524 0. 0.
0.38408524 0. 0.38408524]]
这个二维数组中的每一行代表一个文档,行中的每个元素代表对应单词的TF-IDF得分。要知道每个元素代表什么词,请查看
.get_feature_names()
函数。它将打印出一个单词列表。例如,在这种情况下,查看第一个文档的行:
[0., 0.46979139, 0.58028582, 0.38408524, 0., 0., 0.38408524, 0., 0.38408524]
在示例中,
.get_feature_names()
返回:
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
因此,你将分数映射到这样的词:
dict(zip(vectorizer.get_feature_names(), X.toarray()[0]))
{'and': 0.0, 'document': 0.46979139, 'first': 0.58028582, 'is': 0.38408524, 'one': 0.0, 'second': 0.0, 'the': 0.38408524, 'third': 0.0, 'this': 0.38408524}
正如评论者所指出的,给定的答案是错误的。下面的方法获取每个标记的稀疏数组的总和。
# initialise vectoriser
tfidf = TfidfVectorizer()
# apply to corpus of documents
X = tfidf.fit_transform(docs)
# map feature names to sum of vector array
tfidf_dict = dict(zip(tfidf.get_feature_names_out(), X.toarray().sum(axis=0)))
# sort in descending order
tfidf_dict = dict(sorted(tfidf_dict.items(), key=lambda x: x[1], reverse=True))
然后,您可以选择将其显示为熊猫数据框...
# initialise dataframe
tfidf_df = pd.DataFrame.from_dict(tfidf_dict, orient='index', columns=['tfidf'])
# name the index
tfidf_df.index = tfidf_df.index.rename('token')
# display first 5 rows
tfidf_df.head()