如何从gensim打印LDA主题模型?蟒蛇

问题描述 投票:15回答:9

使用gensim我能够从LSA中的一组文档中提取主题但是如何访问从LDA模型生成的主题?

当打印lda.print_topics(10)时,代码给出了以下错误,因为print_topics()返回NoneType

Traceback (most recent call last):
  File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module>
    for top in lda.print_topics(2):
TypeError: 'NoneType' object is not iterable

编码:

from gensim import corpora, models, similarities
from gensim.models import hdpmodel, ldamodel
from itertools import izip

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

# remove common words and tokenize
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist]
         for document in documents]

# remove words that appear only once
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) == 1)
texts = [[word for word in text if word not in tokens_once]
         for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# I can print out the topics for LSA
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus]

for l,t in izip(corpus_lsi,corpus):
  print l,"#",t
print
for top in lsi.print_topics(2):
  print top

# I can print out the documents and which is the most probable topics for each doc.
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=50)
corpus_lda = lda[corpus]

for l,t in izip(corpus_lda,corpus):
  print l,"#",t
print

# But I am unable to print out the topics, how should i do it?
for top in lda.print_topics(10):
  print top
python nlp lda topic-modeling gensim
9个回答
15
投票

经过一番乱搞后,似乎print_topics(numoftopics)ldamodel有一些bug。所以我的解决方法是使用print_topic(topicid)

>>> print lda.print_topics()
None
>>> for i in range(0, lda.num_topics-1):
>>>  print lda.print_topic(i)
0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system
...

0
投票

使用Gensim清理它自己的主题格式。

Topic Word  P  
0     w1    0.004437  
0     w2    0.003553  
0     w3    0.002953  
0     w4    0.002866  
0     w5    0.008813  
1     w6    0.003393  
1     w7    0.003289  
1     w8    0.003197 
... 

输出:

****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this word comes in**?** 



for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
        print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

8
投票

我认为show_topics的语法随着时间的推移而发生了变化:

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

对于num_topics主题数,返回num_words最重要的单词(默认情况下每个主题10个单词)。

主题作为列表返回 - 如果格式为True,则为字符串列表;如果为False,则为(概率,单词)2元组的列表。

如果log为True,也将此结果输出到log。

与LSA不同,LDA中的主题之间没有自然排序。因此,所有主题的返回的num_topics <= self.num_topics子集是任意的,并且可以在两次LDA训练运行之间改变。


6
投票

你在使用任何记录吗? qazxsw poi按照qazxsw poi中的说明打印到日志文件中。

正如@ mac389所说,print_topics是打印到屏幕的方式。


3
投票

您可以使用:

docs

2
投票

以下是打印主题的示例代码:

lda.show_topics()

2
投票

我认为将主题视为单词列表总是更有帮助。以下代码段有助于实现该目标。我假设你已经有一个名为for i in lda_model.show_topics(): print i[0], i[1] 的lda模型。

def ExtractTopics(filename, numTopics=5):
    # filename is a pickle file where I have lists of lists containing bag of words
    texts = pickle.load(open(filename, "rb"))

    # generate dictionary
    dict = corpora.Dictionary(texts)

    # remove words with low freq.  3 is an arbitrary number I have picked here
    low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3]
    dict.filter_tokens(low_occerance_ids)
    dict.compactify()
    corpus = [dict.doc2bow(t) for t in texts]
    # Generate LDA Model
    lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics)
    i = 0
    # We print the topics
    for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20):
        i = i + 1
        print "Topic #" + str(i) + ":",
        for p, id in topic:
            print dict[int(id)],

        print ""

在上面的代码中,我决定显示属于每个主题的前30个单词。为简单起见,我已经展示了我得到的第一个主题。

lda_model

我不太喜欢上述主题的外观,所以我通常会将代码修改为如下所示:

for index, topic in lda_model.show_topics(formatted=False, num_words= 30):
    print('Topic: {} \nWords: {}'.format(idx, [w[0] for w in topic]))

...和输出(显示的前2个主题)看起来像。

Topic: 0 
Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental']
Topic: 1 
Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

0
投票

最近,在使用Python 3和Gensim 2.3.0时遇到了类似的问题。 for idx, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} \nWords: {}'.format(idx, '|'.join([w[0] for w in topic]))) Topic: 0 Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental Topic: 1 Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head 没有给出任何错误但也没有打印任何东西。原来print_topics()返回一个列表。所以人们可以这样做:

show_topics()

0
投票

您还可以将每个主题中的顶部单词导出到csv文件。 show_topics()控制每个主题下要导出的单词数。

topic_list = show_topics()
print(topic_list)

CSV文件具有以下格式

topn

0
投票
import pandas as pd

top_words_per_topic = []
for t in range(lda_model.num_topics):
    top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)])

pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")
© www.soinside.com 2019 - 2024. All rights reserved.