使用词嵌入进行主题建模

Question

我目前正在尝试创建一个带有词嵌入的 LDA 模型。这是代码：

from gensim.models.ldamodel import LdaModel
from gensim.corpora.dictionary import Dictionary
from gensim.test.utils import common_texts
from gensim.models.word2vec import Word2Vec

# Convert the text object into a list of sentences
sentences = [' '.join(doc) for doc in texts]

# Train the word2vec model to get word embeddings
model_w2v = Word2Vec(sentences, min_count=1)

# Create a dictionary and corpus for the LDA model
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# Reshape the word embeddings to match the shape expected by the LDA model
word_embeddings = model_w2v.wv.vectors
num_words = len(dictionary)
num_topics = 10
eta = np.zeros((num_words, num_topics))
for i, word in enumerate(dictionary):
    if word in model_w2v.wv.key_to_index:
        idx = model_w2v.wv.key_to_index[word]
        eta[i, :] = model_w2v.wv.vectors[idx]

# Transpose eta to match the expected shape
eta = eta.T

# Train the LDA model using the reshaped word embeddings
lda_model = LdaModel(corpus=corpus,
                     id2word=dictionary,
                     num_topics=num_topics,
                     passes=10,
                     alpha='auto',
                     eta=eta)

# Print the topics generated by the LDA model
for topic in lda_model.show_topics():
    print(topic)

texts 对象是一个列表列表，由我已经预处理过的文档的子列表组成。出于某种原因，我的代码产生的主题基本相同：

(1, '0.019*"benefit" + 0.017*"include" + 0.017*"plan" + 0.013*"supplier" + 0.012*"experience" + 0.011*"team" + 0.010*"business" + 0.010*"ability" + 0.010*"other" + 0.010*"management"')
(2, '0.019*"product" + 0.017*"customer" + 0.016*"work" + 0.010*"experience" + 0.010*"sale" + 0.009*"team" + 0.008*"business" + 0.007*"company" + 0.007*"management" + 0.006*"opportunity"')
(3, '0.032*"product" + 0.017*"team" + 0.016*"experience" + 0.014*"work" + 0.012*"customer" + 0.008*"business" + 0.007*"build" + 0.007*"more" + 0.006*"drive" + 0.006*"company"')
(4, '0.017*"product" + 0.013*"include" + 0.013*"work" + 0.012*"experience" + 0.012*"team" + 0.010*"benefit" + 0.008*"other" + 0.007*"customer" + 0.007*"pay" + 0.007*"program"')
(5, '0.035*"product" + 0.022*"team" + 0.017*"experience" + 0.015*"user" + 0.010*"work" + 0.010*"ability" + 0.009*"company" + 0.009*"program" + 0.009*"datum" + 0.009*"drive"')
(6, '0.017*"product" + 0.015*"experience" + 0.013*"team" + 0.012*"work" + 0.009*"project" + 0.009*"include" + 0.009*"marketing" + 0.008*"development" + 0.007*"business" + 0.007*"other"')
(7, '0.016*"product" + 0.014*"work" + 0.012*"team" + 0.010*"project" + 0.010*"experience" + 0.010*"new" + 0.009*"development" + 0.008*"support" + 0.007*"service" + 0.007*"year"')
(8, '0.015*"experience" + 0.015*"product" + 0.014*"project" + 0.011*"management" + 0.011*"team" + 0.011*"support" + 0.010*"include" + 0.009*"work" + 0.009*"system" + 0.009*"ensure"')
(9, '0.014*"work" + 0.013*"experience" + 0.011*"team" + 0.011*"datum" + 0.011*"security" + 0.009*"study" + 0.008*"product" + 0.008*"design" + 0.007*"development" + 0.007*"customer"')

我显然在这里做错了什么。你能帮帮我吗？

Answer 1

我不熟悉您尝试通过

eta

参数使用词向量作为 LDA 过程的输入。

您是否在遵循某个地方的成功食谱？您期望从这种非标准方法中获得什么优势？通常的 LDA 有什么问题？

看起来您正在将一个错误的语料库传递给

Word2Vec

，其中每个文本都是一个空格分隔的字符串。它期望每个文本都是一个单词列表。通过传递字符串（单字符列表），您的模型学到的唯一“单词”可能只是单字符，它们作为单词向量没有用，更不用说作为后期 LDA 过程的输入了。

（在尝试使用其结果之前，您是否检查过

model_w2v

的合理内容？您是否在 INFO 级别运行日志记录并观察所有输出以获得合理的进度报告？）

在

min_count=1

训练中使用

Word2Vec

本质上不是一个好主意。语料库中只有一个（或几个）示例的单词不会从如此少的数据中获得好的向量——但与完全忽略稀有单词（如默认的

min_count=5

一样）相比，它们通常会使其他单词向量变得更糟。

我看不出（默认

vector_size

）100 维词向量在您的

eta

中有什么用处。（事实上，尝试使用我自己的插头数据模拟您的代码时，我得到了

numpy

广播错误，而不是在尝试将不同宽度的向量分配给它时任何成功的

eta

初始化。）

为什么密集词嵌入的 100 个维度——通常每个维度单独不具有任何清晰易解释的含义——是对只有 10 个主题和一些任意数字的 LDA 词到主题映射的有用提示的话？

使用词嵌入进行主题建模

问题描述投票：0回答：1

1个回答

最新问题

使用词嵌入进行主题建模

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1