为什么我的 tf-idf 值看起来不一致？

Question

我有一系列已转换为代币的推文。其中包括以下内容：

geraldkutney 发生了意识到发生了方便的重命名捕获，但紧急事件发生后政府来了
michaelemann 烧伤发生鸡屎地改变得到愚蠢的争论善行不受惩罚
rickcaughell thomas_6278 coderedearth jrockstrom jordanbpeterson 事实上埃克森公司预测今天的温度会很高准确度回到 1970 年代 80 教授模型准确

请注意，前两条推文共有 13 个令牌，第三条推文有 3 个。

使用以下代码，我创建了 TF-IDF 值：

vectoriser = sk_text.TfidfVectorizer()

vectoriser.fit(twit_api['text_clean'])

twit_vec = vectoriser.transform(twit_api['text_clean'])
twit_vec.columns = vectoriser.get_feature_names_out()

tokens_enc = twit_vec.toarray()

当我查看每个单词中“发生”的 tf-idf 值时，我得到了值

0.41124561276932653

、

0.18906439908376366

和

0.1523571031416618

。

这是代码

print(tokens_enc[row_nos[0], vectoriser.vocabulary_['happen']])

这些价值观对我来说似乎不一致。我希望第一个值等于第二个值的两倍，因为 tf 恰好是两倍，但情况似乎并非如此。

我是不是误会了什么？

Answer 1

有许多参数可以与您的 tfidf 函数一起使用，并且您未指定的参数具有默认值。就您而言，影响您的三个参数是

# ---------------------------
norm{‘l1’, ‘l2’} or None, default=’l2’
Each output row will have unit norm, either:

‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.

‘l1’: Sum of absolute values of vector elements is 1. See normalize.

None: No normalization.

# ---------------------------
use_idfbool, default=True
Enable inverse-document-frequency reweighting. If False, idf(t) = 1.

# ---------------------------
smooth_idfbool, default=True
Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

幕后发生了平滑和标准化，但看起来您可以关闭所有这些。尝试如下所示

vectoriser = sk_text.TfidfVectorizer(norm=None, use_idfbool=False, smooth_idfbool=False)

如果没有您的数据集，我无法对此进行测试，但参考这些参数的函数文档将对解决任何进一步的问题有所帮助。

为什么我的 tf-idf 值看起来不一致？

问题描述投票：0回答：1

1个回答

最新问题

为什么我的 tf-idf 值看起来不一致？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1