如何为多类文本数据集（fastai）定义对数计数比？

Question

[我正在尝试与朴素贝叶斯（Naive Bayes）一起按照瑞秋·托马斯（Rachel Thomas）进行情感分类。在视频中，她使用了二进制数据集（正片和负片评论）。当要应用朴素贝叶斯时，this is what she does：

Defintion：每个单词的对数比率f：

r = log (ratio of feature f in positive documents) / (ratio of feature f in negative documents)

其中肯定文档中特征$ f $的比率是肯定文档具有特征的次数除以肯定文档的数量。

p1 = np.squeeze(np.asarray(x[y.items==positive].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==negative].sum(0)))

pr1 = (p1+1) / ((y.items==positive).sum() + 1)
pr0 = (p0+1) / ((y.items==negative).sum() + 1)

r = np.log(pr1/pr0)

->将log-count-ratio应用于具有2个标签的数据集非常简单！

问题：我的数据集不是二进制的！假设我有5个标签：label_1，...，label_5

如何获得多标签数据集的对数比r？

我的方法：

p4 = np.squeeze(np.asarray(x[y.items==label_5].sum(0)))
p3 = np.squeeze(np.asarray(x[y.items==label_4].sum(0)))
p2 = np.squeeze(np.asarray(x[y.items==label_3].sum(0)))
p1 = np.squeeze(np.asarray(x[y.items==label_2].sum(0)))
p0 = np.squeeze(np.asarray(x[y.items==label_1].sum(0)))

log-count-ratio:
pr1 = (p1+1) / ((y.items==label_2).sum() + 1)
pr1_not = (p1+1) / ((y.items!=label_2).sum() + 1)
r_1 = np.log(pr1/pr1_not)

log-count-ratio:
pr2 = (p2+1) / ((y.items==label_3).sum() + 1)
pr2_not = (p2+1) / ((y.items!=label_3).sum() + 1)
r_2 = np.log(pr2/pr2_not)
...

这是正确的吗？这是否意味着我得到多个比率？

Answer 1

来自https://marvinlsj.github.io/2018/11/23/NBSVM%20for%20sentiment%20and%20topic%20classification/，对数计数比率是从后验概率比率得出的，该比率对比较两个类别以了解最有可能的概率很有用。我猜您正在尝试针对多类问题做一对一的方法。最后将得到5x4 / 2 = 10对比率进行分类。如果您只想进行分类，通常我们会为每个类别计算后验概率，然后选择最佳概率。因此，在您的情况下，您只需从sum（log（p1）），sum（log（p2）），...，sum（log（p5））中选择最佳。

如何为多类文本数据集（fastai）定义对数计数比？

问题描述投票：0回答：1

1个回答

最新问题

如何为多类文本数据集（fastai）定义对数计数比？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1