K 表示自主开发 vs K 表示在 skikit 上学习

问题描述 投票:0回答:0

所以我正在尝试学习K-means聚类算法,我的学习方式是尝试自己重建Kmeans。

这里是数据:https://www.kaggle.com/code/agajorte/people-body-mass-index-clustering

这就是想象中的问题:“给定随机 500 人(250 男,250 女)的身高和体重数据,使用机器学习将他们分组为不同尺码的 T 恤。”

为简单起见(因为我还在学习),我只坚持男性数据,并预先假设只有三种尺寸:大、中、小。

这是我遵循 K 均值程序的代码:

data = df[['Height','Weight']].loc['Male']
data = data.reset_index(drop = True)
data['Label'] = ''

#Picking random centroids for each clusters, you could also use the Random library (I already tested as long as large is largest and small is smallest)

    K = {'large': [180,120],
         'medium': [170, 80],
         'small': [150,60],
         } 

#Should be a while loop, but I don't know the stop condition, so I just use for loop and test with big number of loops
for loop_num in range(50):


    for i in range(len(data)):
#a is the point closest to one of the 3 declared centroid
        a = min(
            abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['large'][0]**2 + K['large'][1]**2)),
            abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['medium'][0]**2 + K['medium'][1]**2)),
            abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['small'][0]**2 + K['small'][1]**2))
        )

#with a, we can lable the data

        if a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['large'][0]**2 + K['large'][1]**2)):
            data.loc[i,'Label'] = 3
        elif a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['medium'][0]**2 + K['medium'][1]**2)):
            data.loc[i,'Label'] = 2
        elif a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['small'][0]**2 + K['small'][1]**2)):
            data.loc[i,'Label'] = 1

#we replace Ks with better Ks (using the lable, for example: mean of all '3' data will be the new centroid for 'large')

    K = {'large': [float(DataFrame.mean(data.loc[data['Label'] == 3, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 3, ['Weight']]))],
                    'medium': [float(DataFrame.mean(data.loc[data['Label'] == 2, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 2, ['Weight']]))],
                    'small': [float(DataFrame.mean(data.loc[data['Label'] == 1, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 1, ['Weight']]))],
                    }
print(K)

在这里和那里使用不同的值多次运行,我的结果确实是一致的和合乎逻辑的:

{'large': [179.60638297872342, 132.56382978723406], 'medium': [169.23076923076923, 96.38461538461539], 'small': [150.6595744680851, 75.7872340425532]}

但是,我无法使用 skikit learn 中的 KMeans 获得相同的结果

clusterer = KMeans(n_clusters=3,init='random',tol = 30)
X = data[['Height','Weight']]
clusterer.fit(X)
print(clusterer.cluster_centers_)

每次点击运行,我都会得到不同的结果。

我尝试更改 KMeans() 中的参数。我每次都有不同的结果,但逻辑上没有任何意义。

例如,这是其中一次运行的结果:

[[168.85227273 141.78409091]
 [170.82432432 105.01351351]
 [169.44578313  69.86746988]]
python machine-learning k-means
© www.soinside.com 2019 - 2024. All rights reserved.