LDA空间概率文件的主题分布是什么?

问题描述 投票:0回答:1

我知道LDA模型的创建是概率性的,并且在同一语料库中在相同参数下训练的两个模型不一定是相同的。但是,我想知道输入LDA模型的文档的主题分布是否也是概率性的。

我有一个LDA模型,如下所示:

lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=numTopics,passes=10)

以及两个文件,Doc1和Doc2。我想在lda空间中找到两个文档的余弦相似度,这样:

x = cossim(lda[Doc1], lda[Doc2]).

我注意到的问题是,当我通过多次迭代运行时,余弦相似性并不总是相同的。 (即使我使用相同的保存LDA模型)。相似性非常相似,但每次总是有点偏差。在我的实际代码中,我有数百个文档,所以我将主题分布转换为密集向量并使用numpy在矩阵中进行计算:

documentsList = np.array(documentsList)
calcMatrix=1-cdist(documentsList, documentsList, metric=self.metric)

我是否遇到numpy(或我的代码中的其他错误)的舍入错误,或者在使用lda查找文档的主题分布时我应该期待这种行为?

编辑:我将使用我的lda模型在2个不同的文档上运行简单的余弦相似度,并绘制结果的范围。我会用我发现的东西报告。

好的,这里是使用相同的LDA模型运行2个文档的余弦相似性的结果。

这是我的代码:

def testSpacesTwoDocs(doc1, doc2, dictionary):
    simList = []
    lda = gensim.models.ldamodel.LdaModel.load('LDA_Models/lda_bow_behavior_allFields_t385_p10')
    for i in range(50):
        doc1bow = dictionary.doc2bow(doc1)
        doc2bow = dictionary.doc2bow(doc2)

        vec1 = lda[doc1bow]
        vec2 = lda[doc2bow]

        S = matutils.cossim(vec1, vec2)
        simList.append(S)


    for entry in simList:
        print entry

    sns.set_style("darkgrid")
    plt.plot(simList, 'bs--')
    plt.show()


    return

下面是我的结果:0.0082616863035,0.00828413767524,0.00826550453411,0.00816756826185,0.00829832701338,0.00828970584276,0.00828578705814,0.00817109902484,0.00817138247141,0.00825297374028,0.008269435921,0.00826470121538,0.00818282042634,0.00824660449673,0.00818087532906,0.0081770261766,0.00817128310123,0.00817643202588,0.00827404791376,0.00832439428054,0.00816643128216,0.00828540881955,0.00825746652101 ,0.00816793513824,0.00828471827526,0.00827161219003,0.00817773114553,0.00826166001503,0.00828048713541,0.00817435544365,0.0082956702812,0.00826167470288,0.00829873425476,0.00825744872634,0.00826802120149,0.00829604894909,0.0081776752236,0.00817613482849,0.00825839326441,0.00817530362838,0.0081747561999,0.0082597447174,0.00828958180101,0.00827157760835,0.00826939127657,0.00826138381094,0.00817755590806,0.00827135780051 ,0.00827314260067,0.00817035250043

我是否正确假设LDA模型在每次迭代中推断两个文档的主题分布,因此余弦相似性是随机的而不是威慑的?这种变化是否表明我没有长时间训练我的模型?或者我没有正确地规范化向量?谢谢

谢谢

python numpy gensim lda
1个回答
0
投票

在训练LDA模型时,尝试将random_state设置为相同的状态。

lda = models.LdaMulticore(corpus=corpus, id2word=dictionary, num_topics=numTopics, passes=10, random_state=0)

当LDA初始化时,在推理期间,它使用向模型引入噪声的随机矩阵。这种噪音很小,一般不会影响最终结果 - 提供足够的数据。

© www.soinside.com 2019 - 2024. All rights reserved.