使用gensim了解LDA实现

Question

我试图了解Python中的gensim包如何实现Latent Dirichlet Allocation。我正在做以下事情：

定义数据集

documents = ["Apple is releasing a new product", 
             "Amazon sells many things",
             "Microsoft announces Nokia acquisition"]

删除停用词后，我创建了字典和语料库：

texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

然后我定义了LDA模型。

lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5, update_every=1, chunksize=10000, passes=1)

然后我打印主题：

>>> lda.print_topics(5)
['0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product', '0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new', '0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is', '0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new', '0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft']
2013-12-03 13:26:21,878 : INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product
2013-12-03 13:26:21,880 : INFO : topic #1: 0.077*nokia + 0.077*announces + 0.077*acquisition + 0.077*apple + 0.077*many + 0.077*amazon + 0.077*sells + 0.077*microsoft + 0.077*things + 0.077*new
2013-12-03 13:26:21,880 : INFO : topic #2: 0.181*microsoft + 0.181*announces + 0.181*acquisition + 0.181*nokia + 0.031*many + 0.031*sells + 0.031*amazon + 0.031*apple + 0.031*new + 0.031*is
2013-12-03 13:26:21,881 : INFO : topic #3: 0.077*acquisition + 0.077*announces + 0.077*sells + 0.077*amazon + 0.077*many + 0.077*nokia + 0.077*microsoft + 0.077*releasing + 0.077*apple + 0.077*new
2013-12-03 13:26:21,881 : INFO : topic #4: 0.158*releasing + 0.158*is + 0.158*product + 0.158*new + 0.157*apple + 0.027*sells + 0.027*nokia + 0.027*announces + 0.027*acquisition + 0.027*microsoft
>>>

我无法理解这个结果。它是否提供了每个单词出现的概率？另外，主题＃1，主题＃2等的含义是什么？我期待的东西或多或少像最重要的关键词。

我已经检查了gensim tutorial，但它并没有真正帮助太多。

谢谢。

Answer 1

你正在寻找的答案是在gensim tutorial。 lda.printTopics(k)为k随机选择的主题打印最有贡献的单词。可以假设这是（部分地）在每个给定主题上的单词分布，意味着这些单词出现在左侧主题中的概率。

通常，人们会在大型语料库上运行LDA。在一个可笑的小样本上运行LDA将不会给出最好的结果。

Answer 2

我认为本教程将帮助您非常清楚地理解所有内容 - https://www.youtube.com/watch?v=DDq3OVp9dNA

起初我也很难理解它。我将简要概述几点。

在潜在的Dirichlet分配中，

单词的顺序在文档中并不重要 - Bag of Words模型。
文档是主题的分布
反过来，每个主题是对属于词汇表的单词的分布
LDA是概率生成模型。它用于使用后验分布推断隐藏变量。

想象一下创建一个类似这样的文档的过程 -

选择主题分布
绘制主题 - 并从主题中选择单词。对每个主题重复此操作

LDA在这条线上有点回溯 - 你有一袋代表文件的文字，它代表的主题是什么？

所以，在你的情况下，第一个主题（0）

INFO : topic #0: 0.181*things + 0.181*amazon + 0.181*many + 0.181*sells + 0.031*nokia + 0.031*microsoft + 0.031*apple + 0.031*announces + 0.031*acquisition + 0.031*product

更多的是关于things，amazon和many，因为它们具有更高的比例，而不是microsoft或apple具有显着更低的价值。

我建议阅读这篇博客以获得更好的理解（Edwin Chen是天才！） - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

Answer 3

由于上面的答案已经发布，现在有一些非常好的可视化工具，可以使用gensim获得LDA的直觉。

看看pyLDAvis包。这是一个伟大的notebook overview。这是一个非常有用的video description面向最终用户（9分钟教程）。

希望这可以帮助！

Answer 4

为了理解gensim LDA实现的用法，我最近撰写了博客文章，从头开始在Python中使用70,000个简单wiki转储文章实现主题建模。

在这里，详细解释了gensim的LDA如何用于主题建模。人们可以找到它的用法

ElementTree library for extraction of article text from XML dumped file.
Regex filters to clean the articles.
NLTK stop words removal & Lemmatization
LDA from gensim library

希望它有助于理解gensim包的LDA实现。

第1部分

Topic Modelling (Part 1): Creating Article Corpus from Simple Wikipedia dump

第2部分

Topic Modelling (Part 2): Discovering Topics from Articles with Latent Dirichlet Allocation

我得到的几个主题的词云（10个单词）作为结果。

Answer 5

它返回该单词与该主题相关联的可能性百分比。默认情况下，LDA会显示前十个单词:)

使用gensim了解LDA实现

问题描述投票：25回答：5

5个回答

最新问题

使用gensim了解LDA实现

问题描述 投票：25回答：5

5个回答

最新问题

问题描述投票：25回答：5