我无法找到检查使用 BERTopic 创建的主题模型的一致性分数时遇到的问题的解决方案。我对使用这些方法进行 NLP 很陌生,尤其是对使用 Python 很陌生。我目前正在研究如何从我的模型中得出主题连贯性分数。然而,另一种分类指标可能更合适。
这是我的代码,显示了我的数据设置,并展示了我如何使用驱动器中预先训练且本地保存的模型。
# load libraries
%%capture
!pip install bertopic
from bertopic import BERTopic
# mount google drive, permit access
from google.colab import drive
drive.mount('/content/drive', force_remount = True)
# import data and define columns needed
import pandas as pd
data = pd.read_csv("/content/drive/MyDrive/BERTopic_test_data.csv")
docs = data["text"]
# load in pre saved model
my_model = BERTopic.load('/content/drive/MyDrive/my_model')
# create the topics using pre-saved model
topic_model = my_model
topics, _ = topic_model.fit_transform(docs)
为了提供更多背景信息,以下是 BERT 模型的组件,以及训练时选择的参数
my_model
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
import spacy
from spacy.lang.en.examples import sentences
# defining model components, as well as parameter tuning
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
umap_model = UMAP(n_neighbors = 15, n_components = 5, min_dist = 0.05, random_state = 42
hdbscan_model = HDBSCAN(min_cluster_size = 25, min_samples = 10,
gen_min_span_tree = True,
prediction_data = True)
vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words = stopwords)
# building the model
my_model = BERTopic(
umap_model = umap_model,
hdbscan_model = hdbscan_model,
embedding_model = embedding_model,
vectorizer_model = vectorizer_model,
top_n_words = 10,
language = 'english',
verbose = True
)
我尝试过在网上找到的解决方案,但遇到了错误消息 “AttributeError:‘BERTopic’对象没有属性‘id2word’”
# import library from gensim
from gensim.models import CoherenceModel
# instantiate topic coherence model
cm = CoherenceModel(model=topic_model, texts=docs, coherence='c_v')
# get topic coherence score
coherence_bert = cm.get_coherence()
print(coherence_bert)
通常,NLP模型的性能是通过精度(P)、召回率(R)和F1指标来评估的。您基本上有 4 种类型的预测结果,但到目前为止您只对其中两种感兴趣:真阳性 (TP) 和假阳性 (FP),基本上是您的预测是否等于您的预期结果。
如果您构建两个列表来比较它们,您可以使用 scikit-learn 库轻松获取这些指标:
# Library required
from sklearn.metrics import precision_recall_fscore_support
# List holding true (expected) outcomes
ny_true = []
# List holding predicted outcomes
ny_pred = []
print(precision_recall_fscore_support(ny_true, ny_pred, average='macro'))
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora.dictionary import Dictionary
from gensim import corpora
def calculate_coherence_score(topic_model, docs):
# Preprocess documents
cleaned_docs = topic_model._preprocess_text(docs)
# Extract vectorizer and tokenizer from BERTopic
vectorizer = topic_model.vectorizer_model
tokenizer = vectorizer.build_tokenizer()
# Extract features for Topic Coherence evaluation
words = vectorizer.get_feature_names_out()
# depending on the version and if you get an error use commented out code below:
# words = vectorizer.get_feature_names()
tokens = [tokenizer(doc) for doc in cleaned_docs]
dictionary = corpora.Dictionary(tokens)
corpus = [dictionary.doc2bow(token) for token in tokens]
# Create topic words
topic_words = [[dictionary.token2id[w] for w in words if w in dictionary.token2id]
for _ in range(topic_model.nr_topics)]
# this creates a list of the token ids (in the format of integers) of the words in words that are also present in the
# dictionary created from the preprocessed text. The topic_words list contains list of token ids for each
# topic.
coherence_model = CoherenceModel(topics=topic_words,
texts=tokens,
corpus=corpus,
dictionary=dictionary,
coherence='c_v')
coherence = coherence_model.get_coherence()
return coherence