根据分类指标评估 BERTopic 模型

问题描述 投票:0回答:2

我无法找到检查使用 BERTopic 创建的主题模型的一致性分数时遇到的问题的解决方案。我对使用这些方法进行 NLP 很陌生,尤其是对使用 Python 很陌生。我目前正在研究如何从我的模型中得出主题连贯性分数。然而,另一种分类指标可能更合适。

这是我的代码,显示了我的数据设置,并展示了我如何使用驱动器中预先训练且本地保存的模型。

# load libraries 
%%capture
!pip install bertopic
from bertopic import BERTopic

# mount google drive, permit access
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

# import data and define columns needed
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/BERTopic_test_data.csv")
docs = data["text"]

# load in pre saved model
my_model = BERTopic.load('/content/drive/MyDrive/my_model')

# create the topics using pre-saved model 
topic_model = my_model
topics, _ = topic_model.fit_transform(docs)

为了提供更多背景信息,以下是 BERT 模型的组件,以及训练时选择的参数

my_model

from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
import spacy
from spacy.lang.en.examples import sentences 

# defining model components, as well as parameter tuning
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
umap_model = UMAP(n_neighbors = 15, n_components = 5, min_dist = 0.05, random_state = 42
hdbscan_model = HDBSCAN(min_cluster_size = 25, min_samples = 10,
                        gen_min_span_tree = True,
                        prediction_data = True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = stopwords)


# building the model 
my_model = BERTopic(
    umap_model = umap_model,
    hdbscan_model = hdbscan_model,
    embedding_model = embedding_model,
    vectorizer_model = vectorizer_model,
    top_n_words = 10,
    language = 'english',
    verbose = True
)

我尝试过在网上找到的解决方案,但遇到了错误消息 “AttributeError:‘BERTopic’对象没有属性‘id2word’”

# import library from gensim  
from gensim.models import CoherenceModel

# instantiate topic coherence model
cm = CoherenceModel(model=topic_model, texts=docs, coherence='c_v')

# get topic coherence score
coherence_bert = cm.get_coherence() 
print(coherence_bert)
python nlp bert-language-model topic-modeling
2个回答
0
投票

通常,NLP模型的性能是通过精度(P)、召回率(R)和F1指标来评估的。您基本上有 4 种类型的预测结果,但到目前为止您只对其中两种感兴趣:真阳性 (TP) 和假阳性 (FP),基本上是您的预测是否等于您的预期结果。

  • P 对应于 TP/TP+FP 的数量,这意味着在所有预测标签中,您正确分类了多少。
  • R对应于TP的数量/(数据集中类实例的数量)
  • F1 是这两个指标之间的谐波平均值,可让您全面了解其性能。越高越好。

如果您构建两个列表来比较它们,您可以使用 scikit-learn 库轻松获取这些指标:

# Library required
from sklearn.metrics import precision_recall_fscore_support

# List holding true (expected) outcomes
ny_true = []

# List holding predicted outcomes
ny_pred = []

print(precision_recall_fscore_support(ny_true, ny_pred, average='macro'))

0
投票
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora

def calculate_coherence_score(topic_model, docs):
    # Preprocess documents
    cleaned_docs = topic_model._preprocess_text(docs)

    # Extract vectorizer and tokenizer from BERTopic
    vectorizer = topic_model.vectorizer_model
    tokenizer = vectorizer.build_tokenizer()

    # Extract features for Topic Coherence evaluation
    words = vectorizer.get_feature_names_out()
    # depending on the version and if you get an error use commented out code below:
    # words = vectorizer.get_feature_names()
    tokens = [tokenizer(doc) for doc in cleaned_docs]
    dictionary = corpora.Dictionary(tokens)
    corpus = [dictionary.doc2bow(token) for token in tokens]
    # Create topic words
    topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
    for _ in range(topic_model.nr_topics)]

    # this creates a list of the token ids (in the format of integers) of the words in words that are also present in the 
    # dictionary created from the preprocessed text. The topic_words list contains list of token ids for each 
    # topic.

    coherence_model = CoherenceModel(topics=topic_words,
                                    texts=tokens,
                                    corpus=corpus,
                                    dictionary=dictionary,
                                    coherence='c_v')
    coherence = coherence_model.get_coherence()

    return coherence
© www.soinside.com 2019 - 2024. All rights reserved.