我对 Python 还很陌生,所以这可能比我看起来更容易,但我被困住了。我正在尝试使用 BERTopic 并使用 PyLDAVis 可视化结果。我想将结果与使用 LDA 得到的结果进行比较。
这是我的代码,其中“data_words”是我之前在 LDA 主题建模中使用的同一对象:
import pyLDAvis
import numpy as np
from bertopic import BERTopic
# Train Model
bert_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = bert_model.fit_transform(data_words)
# Prepare data for PyLDAVis
top_n = 5
topic_term_dists = bert_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in bert_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [bert_model.vectorizer_model.vocabulary_[word] for word in vocab]
data = {'topic_term_dists': topic_term_dists,
'doc_topic_dists': doc_topic_dists,
'doc_lengths': doc_lengths,
'vocab': vocab,
'term_frequency': term_frequency}
# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)
我不断收到以下错误,但我不明白如何解决该问题:
/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[9], line 4
1 from bertopic import BERTopic
3 bert_model = BERTopic()
----> 4 topics, probs = bert_model.fit_transform(data_words)
6 bert_model.get_topic_freq()
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_bertopic.py:373, in BERTopic.fit_transform(self, documents, embeddings, images, y)
325 """ Fit the models on a collection of documents, generate topics,
326 and return the probabilities and topic per document.
327
(...)
370 ```
371 """
372 if documents is not None:
--> 373 check_documents_type(documents)
374 check_embeddings_shape(embeddings, documents)
376 doc_ids = range(len(documents)) if documents is not None else range(len(images))
File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_utils.py:43, in check_documents_type(documents)
41 elif isinstance(documents, Iterable) and not isinstance(documents, str):
42 if not any([isinstance(doc, str) for doc in documents]):
---> 43 raise TypeError("Make sure that the iterable only contains strings.")
44 else:
45 raise TypeError("Make sure that the documents variable is an iterable containing strings only.")
TypeError: Make sure that the iterable only contains strings.
编辑: 因此,我假设我要分析的数据的格式与 BERTopic 期望的格式不同。我的数据集的结构如下:
{
"TFU_1881_00102": {
"magazine": "edited out",
"country": "United Kingdom",
"year": "1881",
"tokens": [
"word1",
"word2"
],
"bigramFreqs": {
"word1 word2": 1
},
"tokenFreqs": {
"word1": 1,
"word2": 1
}
},
"TFU_1881_00103": {
"magazine": "edited out",
"country": "United Kingdom",
"year": "1881",
"tokens": [
"word3",
"word4"
],
"bigramFreqs": {
"word3 word4": 1
},
"tokenFreqs": {
"word3": 1,
"word4": 1
}
}
}
然后我使用以下代码创建“data_words”对象:
with open("Data/5_json/output_final.json", "r") as file:
data = json.load(file)
data_words = []
counter = 0
for key in data:
counter += 1
sub_list = data[key]["tokens"]
data_words.append(sub_list)
print(counter)
data_words
是一个嵌套列表。
它包含
lists
和 strings
。
bert_model.fit_transform(data_words)
.fit()
期待 iterable
,但只有 strings
。
您可以尝试展平
data_words
,使其仅包含字符串,然后使用:
bert_model.fit_transform(data_words)