在这个例子中,scikit-learn的tf-idf是否正确?最常见的单词得分很高

问题描述 投票:0回答:1
from sklearn.feature_extraction.text import TfidfVectorizer

documents=["The car is driven on the road","The truck is                  
            driven on the highway","the lorry is"]
fidf_transformer=TfidfVectorizer(smooth_idf=True,use_idf=True)
tfidf=tfidf_transformer.fit_transform(documents)
print(tfidf_transformer.vocabulary_)
print(tfidf.toarray())

{'the': 7, 'car': 0, 'on': 5, 'driven': 1, 'is': 3, 'road':         6, 'lorry': 4, 'truck': 8, 'highway': 2}
[[0.45171082 0.34353772 0. 0.26678769 0.  0.34353772 0.45171082 0.53357537 0.        ]
 [0.         0.34353772 0.45171082 0.26678769 0.         0.34353772 0.         0.53357537 0.45171082]
 [0.         0.         0.         0.45329466 0.76749457 0. 0.         0.45329466 0.        ]]

“the”这个词应该在三个文件中得分较低

tf-idf tfidfvectorizer
1个回答
0
投票

tfidf =术语频率(tf)*逆doc频率(idf)

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)
print(vectorizer.get_feature_names())
print (X.toarray())
print ("---")
t = TfidfTransformer(use_idf=True, norm=None, smooth_idf=False)
a = t.fit_transform(X)
print (a.toarray())
print ("---")
print (t.idf_)

输出:

enter image description here

if(the)是低但是tf(,doc 1)= 2是高的,这推动了其他的话。

从上面的示例代码:

idf(没有Norm,非平滑的idf)是== == 1

然而,tf(,doc1)= 2和tf(是,doc1)= 1,这会增加tfidf(the,doc1)的tfidf的值。

同样idf(car)= 2.09861229但是tf(car,doc1)= 1,=> tfidf(car,doc1)= 2.09861229,这非常接近tfidf(the,doc1)。 idf的平滑进一步缩小了差距。

在大型语料库中,差异变得更加突出。

尝试通过禁用平滑和无规范化来运行代码,以查看对小型语料库的影响。

tfidf_transformer = TfidfVectorizer(smooth_idf = False,use_idf = True,norm = None)

© www.soinside.com 2019 - 2024. All rights reserved.