TF-IDF 值与 TfidfVectorizer

Question

我正在学习 NLP，并且有兴趣使用 sklearn 库和类来理解 TF-IDF 模型

TfidfVectorizer

我在下面粘贴了示例代码。


corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
pd.DataFrame(X.toarray(), columns = vectorizer.get_feature_names())

功能名称：

vectorizer.get_feature_names()

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

tf-idf 值是：

array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

我有兴趣计算上述语料库中术语“文档”的 tf-idf 值，第一个文档的结果为 0.43877674。

我尝试对以 10 为底和以 e（自然对数）为底的公式使用以下公式，因为默认情况下为

smooth_idf=True

，并且根据 https://scikit-learn.org/stable/modules/feature_extraction.html# 中编写的文档tfidf 项加权

使用TfidfTransformer的默认设置，TfidfTransformer(norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False) 词频，一个词在给定文档中出现的次数，乘以idf分量，计算为

其中

是文档集中的文档总数，

df(t)

是文档集中包含术语

的文档数

根据所写程序的输出，应该是0.43877674

TF-IDF 值与 TfidfVectorizer

问题描述投票：0回答：0

最新问题

TF-IDF 值与 TfidfVectorizer

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0