计算二元组和差异的 PMI

Question

假设我有以下文字：

text = "this is a foo bar bar black sheep  foo bar bar black sheep foo bar bar black sheep shep bar bar black sentence"

我可以使用 NLTK 计算二元组的 PMI，如下所示：

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
for i in finder.score_ngrams(bigram_measures.pmi):
    print(i)

给出：

(('is', 'a'), 4.523561956057013)
(('this', 'is'), 4.523561956057013)
(('a', 'foo'), 2.938599455335857)
(('sheep', 'shep'), 2.938599455335857)
(('black', 'sentence'), 2.523561956057013)
(('black', 'sheep'), 2.523561956057013)
(('sheep', 'foo'), 2.353636954614701)
(('bar', 'black'), 1.523561956057013)
(('foo', 'bar'), 1.523561956057013)
(('shep', 'bar'), 1.523561956057013)
(('bar', 'bar'), 0.5235619560570131)

现在检查我自己的理解，我想找到 PMI 的 PMI（“黑”，“羊”）。 PMI 公式为：

文本中有 4 个“黑色”实例，文本中有 3 个“羊”实例，并且黑色和羊出现了 3 次，文本长度为 23。现在按照我做的公式：

np.log((3/23)/((4/23)*(3/23)))

这给出了 1.749199854809259 而不是 2.523561956057013。我想知道为什么这里有差异？我在这里缺少什么？

Answer 1

您的 PMI 公式使用以 2 为底的对数，而不是以 e 为底的对数。

从 NumPy 的文档来看，

numpy.log

是以 e 为底的自然对数，这不是你想要的。

以下公式将为您提供

2.523561956057013

的结果：

math.log((3/23)/((4/23)*(3/23)), 2)

计算二元组和差异的 PMI

问题描述投票：0回答：1

1个回答

最新问题

计算二元组和差异的 PMI

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1