如何使用Python使用K-Means将标签簇与真实标签进行匹配

问题描述 投票:0回答:4

我在使用 Kmeans 算法处理标签数据时遇到问题。我的测试句子得到了真实的聚类,但我没有得到真实的标签。我已经使用 numpy 将集群与 true_label_test 进行匹配,但是这个 kmeans 可以移动集群,真实标签与集群数量不匹配。我需要帮助解决这个问题。这是我的代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter

stop = set(stopwords.words('indonesian'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

# Cleaning the text sentences so that punctuation marks, stop words & digits are removed  
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    #print (y)
    return y

path = "coba.txt"

train_clean_sentences = []
fp = open(path,'r')
for line in fp:
    line = line.strip()
    cleaned = clean(line)
    cleaned = ' '.join(cleaned)
    train_clean_sentences.append(cleaned)

#print(train_clean_sentences)
       
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_clean_sentences)

# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100)
modelkmeans.fit(X)

teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN"

test_clean_sentence = []

cleaned_test = clean(teks_satu)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+","",cleaned)
test_clean_sentence.append(cleaned)
    
Test = vectorizer.transform(test_clean_sentence) 

true_test_labels = ['AI','VR','Sistem Informasi']

predicted_labels_kmeans = modelkmeans.predict(Test)
print(predicted_labels_kmeans)

print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------")
print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0])
print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0]) 
print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0])
print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],":",predicted_labels_kmeans)

python numpy flask scikit-learn k-means
4个回答
1
投票

我遇到了同样的问题:我的集群(kmeans)确实返回了不同的类(集群编号),然后返回了真实的类。结果是真实标签和预测标签不匹配。对我有用的解决方案是this代码(滚动到“排列最大化对角线元素之和”)。虽然这种方法效果很好,但我认为在某些情况下它可能是错误的。


1
投票

这是一个具体示例,展示了如何将

KMeans
集群 ID 与训练数据标签进行匹配。基本思想是假设分类正确完成,
confusion_matrix
在其对角线上应具有较大的值。这是将聚类中心 id 与训练标签关联之前的混淆矩阵:

cm = 
array([[  0, 395,   0,   5,   0],
       [  0,   2,   5, 391,   2],
       [  2,   0,   0,   0, 398],
       [  0,   0, 400,   0,   0],
       [398,   0,   0,   0,   2]])

现在我们只需要对混淆矩阵重新排序,使其大值重新定位在对角线上。可以轻松实现

cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])

这里我们得到了新的混淆矩阵,现在看起来很熟悉,对吧?

cm_ = 
array([[395,   5,   0,   0,   0],
       [  2, 391,   2,   5,   0],
       [  0,   0, 398,   0,   2],
       [  0,   0,   0, 400,   0],
       [  0,   0,   2,   0, 398]])

您可以进一步验证结果

accuracy_score

y_pred_ = np.array([cm_argmax[i] for i in y_pred])
accuracy_score(y,y_pred_)
# 0.991

完整的独立代码在这里:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix,accuracy_score
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

plt.figure(figsize=(8, 4))
plot_clusters(X)
plt.show()

k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
cm = confusion_matrix(y, y_pred)
cm
cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])
cm_ = confusion_matrix(y, y_pred)
cm_
accuracy_score(y,y_pred_)

0
投票

您可以将每个集群中大多数真实标签的标签分配给该集群


0
投票

Albert G Lieu 的解决方案很好,对我帮助很大,但如果混淆矩阵在某个轴上给出相同的结果,可能会出现重复的索引值问题。

这部分:

cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])

应替换为:

cm_argmax = cm.argmax(axis=0)

# Find the duplicate value
duplicate_value = None
for value in cm_argmax:
    if np.count_nonzero(cm_argmax == value) > 1:
        duplicate_value = value
        break

# Find the missing value
missing_value = None
for i in range(len(cm_argmax)):
    if i not in cm_argmax:
        missing_value = i
        break

# Replace one of the duplicate values with the missing value at the correct index
corrected_cm_argmax = np.copy(cm_argmax)
for i, value in enumerate(cm_argmax):
    if value == duplicate_value:
        corrected_cm_argmax[i] = missing_value
        break

corrected_cm_argmax


y_pred_ = np.array([cm_argmax[i] for i in y_pred])
最新问题
© www.soinside.com 2019 - 2024. All rights reserved.