BERTopic:“确保可迭代对象仅包含字符串”

问题描述 投票:0回答:1

我对 Python 还很陌生,所以这可能比我看起来更容易,但我被困住了。我正在尝试使用 BERTopic 并使用 PyLDAVis 可视化结果。我想将结果与使用 LDA 得到的结果进行比较。

这是我的代码,其中“data_words”是我之前在 LDA 主题建模中使用的同一对象:

import pyLDAvis
import numpy as np
from bertopic import BERTopic

# Train Model
bert_model = BERTopic(verbose=True, calculate_probabilities=True)
topics, probs = bert_model.fit_transform(data_words)

# Prepare data for PyLDAVis
top_n = 5

topic_term_dists = bert_model.c_tf_idf.toarray()[:top_n+1, ]
new_probs = probs[:, :top_n]
outlier = np.array(1 - new_probs.sum(axis=1)).reshape(-1, 1)
doc_topic_dists = np.hstack((new_probs, outlier))
doc_lengths = [len(doc) for doc in docs]
vocab = [word for word in bert_model.vectorizer_model.vocabulary_.keys()]
term_frequency = [bert_model.vectorizer_model.vocabulary_[word] for word in vocab]

data = {'topic_term_dists': topic_term_dists,
        'doc_topic_dists': doc_topic_dists,
        'doc_lengths': doc_lengths,
        'vocab': vocab,
        'term_frequency': term_frequency}

# Visualize using pyLDAvis
vis_data= pyLDAvis.prepare(**data, mds='mmds')
pyLDAvis.display(vis_data)

我不断收到以下错误,但我不明白如何解决该问题:

/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[9], line 4
      1 from bertopic import BERTopic
      3 bert_model = BERTopic()
----> 4 topics, probs = bert_model.fit_transform(data_words)
      6 bert_model.get_topic_freq()

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_bertopic.py:373, in BERTopic.fit_transform(self, documents, embeddings, images, y)
    325 """ Fit the models on a collection of documents, generate topics,
    326 and return the probabilities and topic per document.
    327 
   (...)
    370 ```
    371 """
    372 if documents is not None:
--> 373     check_documents_type(documents)
    374     check_embeddings_shape(embeddings, documents)
    376 doc_ids = range(len(documents)) if documents is not None else range(len(images))

File /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/bertopic/_utils.py:43, in check_documents_type(documents)
     41 elif isinstance(documents, Iterable) and not isinstance(documents, str):
     42     if not any([isinstance(doc, str) for doc in documents]):
---> 43         raise TypeError("Make sure that the iterable only contains strings.")
     44 else:
     45     raise TypeError("Make sure that the documents variable is an iterable containing strings only.")

TypeError: Make sure that the iterable only contains strings.

编辑: 因此,我假设我要分析的数据的格式与 BERTopic 期望的格式不同。我的数据集的结构如下:

{
    "TFU_1881_00102": {
        "magazine": "edited out",
        "country": "United Kingdom",
        "year": "1881",
        "tokens": [
            "word1",
            "word2"
        ],
        "bigramFreqs": {
            "word1 word2": 1
        },
        "tokenFreqs": {
            "word1": 1,
            "word2": 1
        }
    },
    "TFU_1881_00103": {
        "magazine": "edited out",
        "country": "United Kingdom",
        "year": "1881",
        "tokens": [
            "word3",
            "word4"
        ],
        "bigramFreqs": {
            "word3 word4": 1
        },
        "tokenFreqs": {
            "word3": 1,
            "word4": 1
        }
    }
}

然后我使用以下代码创建“data_words”对象:

with open("Data/5_json/output_final.json", "r") as file:
    data = json.load(file)

data_words = []
counter = 0
for key in data:
    counter += 1
    sub_list = data[key]["tokens"]
    data_words.append(sub_list)
print(counter)
python python-3.x nlp topic-modeling
1个回答
0
投票

data_words
是一个嵌套列表。

它包含

lists
strings


bert_model.fit_transform(data_words)

.fit()
期待
iterable
,但只有
strings

您可以尝试展平

data_words
,使其仅包含字符串,然后使用:

bert_model.fit_transform(data_words)

相关问题:https://github.com/meghutch/tracking_pasc/blob/main/BERTopic%20Preprocessing%20Test%20using%20120%2C000%20test%20tweets.ipynb

© www.soinside.com 2019 - 2024. All rights reserved.