如何计算sklearn LDA模型的一致性得分?

问题描述 投票:1回答:1

这里,best_model_lda是一个基于sklearn的LDA模型,我们正试图找到这个模型的一致性得分。

coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\n Coherence Score :',coherence_lda)

输出 : 弹出这个错误是因为我在试图寻找sklearn LDA主题模型的一致性得分,有没有办法绕过它。另外,sklearn LDA是用什么指标把这些词分到一起的?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
   490                 matutils.argsort(topic, topn=topn, reverse=True) for topic in
--> 491                 model.get_topics()
   492             ]

AttributeError: 'LatentDirichletAllocation' object has no attribute 'get_topics'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-106-ce8558d82330> in <module>
----> 1 coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v')
     2 coherence_lda = coherence_model_lda.get_coherence()
     3 print('\n Coherence Score :',coherence_lda)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in __init__(self, model, topics, texts, corpus, dictionary, window_size, keyed_vectors, coherence, topn, processes)
   210         self._accumulator = None
   211         self._topics = None
--> 212         self.topics = topics
   213 
   214         self.processes = processes if processes >= 1 else max(1, mp.cpu_count() - 1)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in topics(self, topics)
   433                     self.model)
   434         elif self.model is not None:
--> 435             new_topics = self._get_topics()
   436             logger.debug("Setting topics to those of the model: %s", self.model)
   437         else:

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics(self)
   467     def _get_topics(self):
   468         """Internal helper function to return topics from a trained topic model."""
--> 469         return self._get_topics_from_model(self.model, self.topn)
   470 
   471     @staticmethod

~\AppData\Local\Continuum\anaconda3\lib\site-packages\gensim\models\coherencemodel.py in _get_topics_from_model(model, topn)
   493         except AttributeError:
   494             raise ValueError(
--> 495                 "This topic model is not currently supported. Supported topic models"
   496                 " should implement the `get_topics` method.")
   497 

ValueError: This topic model is not currently supported. Supported topic models should implement the `get_topics` method.```
scikit-learn gensim lda
1个回答
0
投票

你可以使用 tmtoolkit 来计算gensim CoherenceModel提供的四个相干性分数。文档的作者声称,该方法 tmtoolkit.topicmod.evaluation.metric_coherence_gensim。 "也支持来自lda和sklearn的模型(通过传递topic_word_distrib、dtm和vocab)!。".

所以,为了得到例如'c_v'相干性度量。

# lda_model - LatentDirichletAllocation()
# vect - CountVectorizer()
# texts - the list of tokenized words
metric_coherence_gensim(measure='c_v', 
                        top_n=25, 
                        topic_word_distrib=lda_model.components_, 
                        dtm=dtm_tf, 
                        vocab=np.array([x for x in vect.vocabulary_.keys()]), 
                        texts=train['cleaned_NOUN'].values)

关于问题的第二部分--据我所知,迷惑性(通常与人类的认知不一致)是sklearn的LDA实现评估的原生方法。

© www.soinside.com 2019 - 2024. All rights reserved.