如何实现与sklearn LDA多核处理？

Question

我有使用sklearn LDA一个主题模式。我的阴茎已经〜75K的文件和矩阵形生成语料库X.shape = (74645, 91542)

当我通过这个矩阵sklearn LDA它需要在我的本地3个小时，并在服务器上它使用11个小时。

所以我的问题是：

Is there a way to use multicore processing in sklearn LDA? or is there a way to reduce my processing time significantly?

任何帮助都感激不尽。

请看看代码：

生成lda_output线需要数小时来运行

vectorizer = CountVectorizer(stop_words='english', ngram_range= (1,2), vocabulary = word_list)
X = vectorizer.fit_transform(documents)

lda_model = LatentDirichletAllocation(n_components=50,            # Number of topics
                                      learning_decay=0.7,
                                      max_iter=10,               # Max learning iterations
                                      learning_method='online',
                                      random_state=100,          # Random state
                                      batch_size=128,            # n docs in each learning iter
                                      evaluate_every = -1,       # compute perplexity every n iters, default: Don't
                                      n_jobs = -1,               # Use all available CPUs
                                     )

#--Because before this line system was running out of memory

%env JOBLIB_TEMP_FOLDER=/tmp

start_time = datetime.datetime.now()

lda_output = lda_model.fit_transform(X)

end_time = datetime.datetime.now()

run_time_lda = end_time - start_time

#output:
#datetime.timedelta(0, 38157, 730304) ~ 11hrs

Answer 1

你可能要重新考虑你的词汇word_list，这似乎是大于你的文件数。尝试从文档构建词汇本身，如果它可以在你的问题的工作。

还指定min_df除去很低频率的话。可能是词形还原/词干可以使用，以减少词汇量，并且它也将有助于LDA更好地学习主题。

我会建议不要使用双字母组的LDA建模/卦，因为这可能会导致无法解释的模型。

如何实现与sklearn LDA多核处理？

问题描述投票：1回答：1

Is there a way to use multicore processing in sklearn LDA? or is there a way to reduce my processing time significantly?

1个回答

最新问题

如何实现与sklearn LDA多核处理？

问题描述 投票：1回答：1

Is there a way to use multicore processing in sklearn LDA? or is there a way to reduce my processing time significantly?

1个回答

最新问题

问题描述投票：1回答：1