我想根据下面的文档计算 tf-idf。我正在使用 python 和 pandas。
import pandas as pd
df = pd.DataFrame({'docId': [1,2,3],
'sent': ['This is the first sentence','This is the second sentence', 'This is the third sentence']})
首先,我想我需要为每一行获取 word_count。所以我写了一个简单的函数:
def word_count(sent):
word2cnt = dict()
for word in sent.split():
if word in word2cnt: word2cnt[word] += 1
else: word2cnt[word] = 1
return word2cnt
然后,我将它应用到每一行。
df['word_count'] = df['sent'].apply(word_count)
但现在我迷路了。我知道如果我使用 Graphlab,有一种计算 tf-idf 的简单方法,但我想坚持使用开源选项。 Sklearn 和 gensim 看起来势不可挡。获取 tf-idf 的最简单解决方案是什么?
Scikit-learn 的实现非常简单:
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['sent'])
您可以指定很多参数。请参阅文档here
fit_transform 的输出将是一个稀疏矩阵,如果你想可视化它,你可以这样做
x.toarray()
In [44]: x.toarray()
Out[44]:
array([[ 0.64612892, 0.38161415, 0. , 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0.64612892, 0.38161415, 0.38161415,
0. , 0.38161415],
[ 0. , 0.38161415, 0. , 0.38161415, 0.38161415,
0.64612892, 0.38161415]])
一个简单的解决方案是使用texthero:
import texthero as hero
df['tfidf'] = hero.tfidf(df['sent'])
In [5]: df.head()
Out[5]:
docId sent tfidf
0 1 This is the first sentence [0.3816141458138271, 0.6461289150464732, 0.381...
1 2 This is the second sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...
2 3 This is the third sentence [0.3816141458138271, 0.0, 0.3816141458138271, ...
我发现使用 sklearn 中的 CountVectorizer 的方法略有不同。 --count vectorizer: 紫外线分析词频 --预处理/清理文本:Usman Malik 抓取推文预处理 我不会在这个答案中介绍预处理。基本上你想要做的是导入 CountVectorizer 并将你的数据拟合到 CountVectorizer 对象,这将使你可以访问 .vocabulary._items() 功能,这将为你提供数据集的词汇表(存在的唯一词及其频率,给定您传递给 CountVectorizer 的任何限制参数,例如匹配特征编号等)
然后,您将使用 Tfidtransformer 以类似的方式为术语生成 tf-idf 权重
我正在使用 pandas 和 pycharm ide 在 jupyter notebook 文件中编码
这是一个代码片段:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
countVec = CountVectorizer(max_features= 5000, stop_words='english', min_df=.01, max_df=.90)
#%%
#use CountVectorizer.fit(self, raw_documents[, y] to learn vocabulary dictionary of all tokens in raw documents
#raw documents in this case will betweetsFrameWords["Text"] (processed text)
countVec.fit(tweetsFrameWords["Text"])
#useful debug, get an idea of the item list you generated
list(countVec.vocabulary_.items())
#%%
#convert to bag of words
#sparse matrix representation? (README: could use an edit/explanation)
countVec_count = countVec.transform(tweetsFrameWords["Text"])
#%%
#make array from number of occurrences
occ = np.asarray(countVec_count.sum(axis=0)).ravel().tolist()
#make a new data frame with columns term and occurrences, meaning word and number of occurences
bowListFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'occurrences': occ})
print(bowListFrame)
#sort in order of number of word occurences, most->least. if you leave of ascending flag should default ASC
bowListFrame.sort_values(by='occurrences', ascending=False).head(60)
#%%
#now, convert to a more useful ranking system, tf-idf weights
#TfidfTransformer: scale raw word counts to a weighted ranking using the
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
tweetTransformer = TfidfTransformer()
#initial fit representation using transformer object
tweetWeights = tweetTransformer.fit_transform(countVec_count)
#follow similar process to making new data frame with word occurrences, but with term weights
tweetWeightsFin = np.asarray(tweetWeights.mean(axis=0)).ravel().tolist()
#now that we've done Tfid, make a dataframe with weights and names
tweetWeightFrame = pd.DataFrame({'term': countVec.get_feature_names(), 'weight': tweetWeightsFin})
print(tweetWeightFrame)
tweetWeightFrame.sort_values(by='weight', ascending=False).head(20)
我认为 Christian Perone 的 示例是如何使用 Count Vectorizer 和 TF_IDF 的最直接示例。这直接来自他的网页。但我也从这里的答案中受益。
https://blog.christianperone.com/2011/10/machine-learning-text-feature-extraction-tf-idf-part-ii/
from sklearn.feature_extraction.text import CountVectorizer
train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
"We can see the shining sun, the bright sun.")
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print "Vocabulary:", count_vectorizer.vocabulary
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print freq_term_matrix.todense()
#[[0 1 1 1]
#[0 2 1 0]]
现在我们有了频率项矩阵(称为 freq_term_matrix),我们可以实例化 TfidfTransformer,它将负责计算我们的项频率矩阵的 tf-idf 权重:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print "IDF:", tfidf.idf_
# IDF: [ 0.69314718 -0.40546511 -0.40546511 0.
]
请注意,我已将范数指定为 L2,这是可选的(实际上默认为 L2-范数),但我添加了参数以向您明确表示它将使用 L2-范数。另请注意,您可以通过访问名为 idf_ 的内部属性来查看计算出的 idf 权重。现在 fit() 方法已经计算了矩阵的 idf,让我们将 freq_term_matrix 转换为 tf-idf 权重矩阵:
--- 我必须对 Python 进行以下更改,并注意 .vocabulary_ 包含单词“the”。我还没有找到或建立一个解决方案......然而---
from sklearn.feature_extraction.text import CountVectorizer
train_set = ["The sky is blue.", "The sun is bright."]
test_set = ["The sun in the sky is bright.", "We can see the shining sun, the bright sun."]
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(train_set)
print ("Vocabulary:")
print(count_vectorizer.vocabulary_)
Vocab = list(count_vectorizer.vocabulary_)
print(Vocab)
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}
freq_term_matrix = count_vectorizer.transform(test_set)
print (freq_term_matrix.todense())
count_array = freq_term_matrix.toarray()
df = pd.DataFrame(data=count_array, columns=Vocab)
print(df)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm="l2")
tfidf.fit(freq_term_matrix)
print ("IDF:")
print(tfidf.idf_)