我是NLP的初学者,这是我第一次进行主题建模。我能够生成模型,但无法生成一致性度量。
将术语文档矩阵从df转换为新的gensim格式->稀疏矩阵-> gensim语料库
sparse_counts = scipy.sparse.csr_matrix(data_dtm)
corpus = matutils.Sparse2Corpus(sparse_counts)
corpus
df_lemmatized.head()
# Gensim also requires dictionary of the all terms and their respective location in the term-document matrix
tfidfv = pickle.load(open("tfidf.pkl", "rb"))
id2word = dict((v, k) for k, v in tfidfv.vocabulary_.items())
id2word
这是我的模特:
lda = models.LdaModel(corpus=corpus, id2word=id2word, num_topics=15, passes=10, random_state=43)
lda.print_topics()
最后,这是我尝试使用相干模型获得相干分数的地方:
# Compute Perplexity
print('\nPerplexity: ', lda.log_perplexity(corpus))
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda, texts=df_lemmatized.long_title, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
这是错误:
---> 57,如果不是dictionary.id2token:#可能无法在标准gensim.corpora.Dictionary中初始化58 setattr(dictionary,'id2token',{v:k对于k,v在dictionary.token2id.items()}中)59AttributeError:“ dict”对象没有属性“ id2token”
我没有您的数据,因此无法重现该错误。所以,我会猜一个!问题出在id2word
内,应该是corpora.dictionary.Dictionary
,而不仅仅是dict
。因此,您需要执行以下操作:
>>> from gensim import corpora
>>>
>>> d = corpora.Dictionary()
>>> d.id2token = id2word
>>> #...
>>> # change `id2word` to `d`
>>> coherence_model_lda = CoherenceModel(model=lda, texts=df_lemmatized.long_title, dictionary=d, coherence='c_v')
而且我认为它现在应该可以正常工作!