聚类文本。 Chatintets 库 Python。 HBDSCAN、UMAP

问题描述 投票:0回答:0

我正在使用 chatintents (https://github.com/dborrelli/chat-intents) 进行自动聚类。为了嵌入句子,我使用句子转换器。问题是当我设置最大和最小簇数然后运行时,它找到的簇数更高。

代码:

X = model.encode(utterances["FCD_COG_INPUT_TEXT"].to_list()) 

hspace = {
    "n_neighbors": hp.choice('n_neighbors', range(3,16)),
    "n_components": hp.choice('n_components', range(100,115)),
    "min_cluster_size": hp.choice('min_cluster_size', range(50,65)),
    "random_state": 42
}

label_lower = 20
label_upper = 30
max_evals = 100

best_params_use, best_clusters_use, trials_use = bayesian_search(X, 
                                                                 space=hspace, 
                                                                 label_lower=label_lower, 
                                                                 label_upper=label_upper, 
                                                                 max_evals=max_evals) 

结果:

100%|██████████| 100/100 [59:49<00:00, 35.90s/trial, best loss: 0.15540102619497703] 
best:
{'min_cluster_size': 51, 'n_components': 106, 'n_neighbors': 7, 'random_state': 42}
label count: 3 

在本例中,有 3 个集群。但有时超过 100

python nlp cluster-analysis hdbscan runumap
© www.soinside.com 2019 - 2024. All rights reserved.