Python（TextBlob）TF-IDF计算

Question

我研究了使用Python计算文档中单词的TF-IDF分数的几种方法。我选择使用TextBlob。

但是，我得到的是负值。我了解这是不正确的（非负数（tf）除以正数（对数）数量（df）不会产生负值。]

我看过这里发布的以下问题：TFIDF calculating confusion，但没有帮助。

我如何计算分数：

 def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

 def n_containing(word, bloblist):
       return sum(1 for blob in bloblist if word in blob)

 def idf(word, bloblist):
       return math.log(len(bloblist) / (1 + n_containing(word, bloblist)))

 def tfidf(word, blob, bloblist):
       return tf(word, blob) * idf(word, bloblist)

然后我只是简单地打印出带有分数的单词。

    "hello, this is a test. a test is always good."


   Top words in document
   Word: good, TF-IDF: -0.06931
   Word: this, TF-IDF: -0.06931
   Word: always, TF-IDF: -0.06931
   Word: hello, TF-IDF: -0.06931
   Word: a, TF-IDF: -0.13863
   Word: is, TF-IDF: -0.13863
   Word: test, TF-IDF: -0.13863

根据我所掌握的知识和所见所闻，可能是IDF计算不正确？

所有帮助将不胜感激。谢谢

Answer 1

没有输入/输出示例，很难查明原因。一种可能是idf()方法，当每个word中都出现blob时，该方法将返回负值。之所以发生这种情况，是因为我认为分母中的+1是为了避免被零除。可能的解决方法是显式检查零：

def idf(word, bloblist):
    x = n_containing(word, bloblist)
    return math.log(len(bloblist) / (x if x else 1))

注意：在这种情况下，一个单词恰好出现在一个blob中，或者根本不出现在blob中，将返回相同的值。还有其他满足您需要的解决方案-请记住不要取小数的log。

Answer 2

IDF分数应为非负数。问题出在idf函数实现中。

尝试以下方法：

from __future__ import division
from textblob import TextBlob
import math

def tf(word, blob):
       return blob.words.count(word) / len(blob.words)

def n_containing(word, bloblist):
    return 1 + sum(1 for blob in bloblist if word in blob)

def idf(word, bloblist):
   return math.log(float(1+len(bloblist)) / float(n_containing(word,bloblist)))

def tfidf(word, blob, bloblist):
   return tf(word, blob) * idf(word, bloblist)

text = 'tf–idf, short for term frequency–inverse document frequency'
text2 = 'is a numerical statistic that is intended to reflect how important'
text3 = 'a word is to a document in a collection or corpus'

blob = TextBlob(text)
blob2 = TextBlob(text2)
blob3 = TextBlob(text3)
bloblist = [blob, blob2, blob3]
tf_score = tf('short', blob)
idf_score = idf('short', bloblist)
tfidf_score = tfidf('short', blob, bloblist)
print tf_score, idf_score, tfidf_score

Python（TextBlob）TF-IDF计算

问题描述投票：2回答：2

2个回答

最新问题

Python（TextBlob）TF-IDF计算

问题描述 投票：2回答：2

2个回答

最新问题

问题描述投票：2回答：2