是否有将 LDA gensim 与 TSNE 一起使用的方法?

问题描述 投票:0回答:0

我创建了一个 gensim LDa 模型,我想像 TSNE 一样在同一个图中呈现聚类词:

from gensim.models import LdaModel,lsimodel
dictionary = Dictionary(all_texts)
corpus = [dictionary.doc2bow(text) for text in all_texts]
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=10, alpha="auto",per_word_topics=True,random_state=42)

我创建了一个函数来绘制单词

from sklearn.manifold import TSNE
from bokeh.plotting import figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook

# Get topic weights
topic_weights = []
for i, row_list in enumerate(lda_model[corpus]):
    topic_weights.append([w for i, w in row_list[0]])

# Array of topic weights    
arr = pd.DataFrame(topic_weights).fillna(0).values

# Keep the well separated points (optional)
arr = arr[np.amax(arr, axis=1) > 0.35]

# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)

# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.99, init='pca')
tsne_lda = tsne_model.fit_transform(arr)

# Plot the Topic Clusters using Bokeh
output_notebook()
n_topics = 10
mycolors = np.array([color for name, color in mcolors.TABLEAU_COLORS.items()])
plot = figure(title="t-SNE Clustering of {} LDA Topics".format(n_topics), 
              plot_width=1500, plot_height=800)
plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num])


from bokeh.models import ColumnDataSource

# Obtenir les mots dominants pour chaque cluster
dominant_words = []
for i in range(n_topics):
    topic_words = lda_model.show_topic(i, topn=6)  # Changer le nombre de mots dominants à afficher ici
    words = ", ".join([word for word, _ in topic_words])
    dominant_words.append(words)

# Créer une source de données pour la légende
legend_source = ColumnDataSource(data=dict(
    topic_num=[str(i) for i in range(n_topics)],
    color=mycolors[:n_topics],
    words=dominant_words
))

# Ajouter les cercles de légende avec les couleurs et les mots
plot.circle(x=0, y=0, fill_color='color', line_color=None, size=10, legend_field='words', source=legend_source)

# Afficher la légende
plot.legend.title = "Clusters"
plot.legend.location = "top_left"
plot.legend.label_text_font_size = "12pt"

show(plot)

问题是我想呈现单词标签而不是点,我希望图中单词的大小 = 语料库中单词的出现次数

现在图表是这样的:

python matplotlib gensim topic-modeling tsne
© www.soinside.com 2019 - 2024. All rights reserved.