所以我正在尝试学习K-means聚类算法,我的学习方式是尝试自己重建Kmeans。
这里是数据:https://www.kaggle.com/code/agajorte/people-body-mass-index-clustering
这就是想象中的问题:“给定随机 500 人(250 男,250 女)的身高和体重数据,使用机器学习将他们分组为不同尺码的 T 恤。”
为简单起见(因为我还在学习),我只坚持男性数据,并预先假设只有三种尺寸:大、中、小。
这是我遵循 K 均值程序的代码:
data = df[['Height','Weight']].loc['Male']
data = data.reset_index(drop = True)
data['Label'] = ''
#Picking random centroids for each clusters, you could also use the Random library (I already tested as long as large is largest and small is smallest)
K = {'large': [180,120],
'medium': [170, 80],
'small': [150,60],
}
#Should be a while loop, but I don't know the stop condition, so I just use for loop and test with big number of loops
for loop_num in range(50):
for i in range(len(data)):
#a is the point closest to one of the 3 declared centroid
a = min(
abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['large'][0]**2 + K['large'][1]**2)),
abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['medium'][0]**2 + K['medium'][1]**2)),
abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['small'][0]**2 + K['small'][1]**2))
)
#with a, we can lable the data
if a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['large'][0]**2 + K['large'][1]**2)):
data.loc[i,'Label'] = 3
elif a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['medium'][0]**2 + K['medium'][1]**2)):
data.loc[i,'Label'] = 2
elif a == abs((data['Height'].loc[i]**2 + data['Weight'].loc[i]**2)-(K['small'][0]**2 + K['small'][1]**2)):
data.loc[i,'Label'] = 1
#we replace Ks with better Ks (using the lable, for example: mean of all '3' data will be the new centroid for 'large')
K = {'large': [float(DataFrame.mean(data.loc[data['Label'] == 3, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 3, ['Weight']]))],
'medium': [float(DataFrame.mean(data.loc[data['Label'] == 2, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 2, ['Weight']]))],
'small': [float(DataFrame.mean(data.loc[data['Label'] == 1, ['Height']])),float(DataFrame.mean(data.loc[data['Label'] == 1, ['Weight']]))],
}
print(K)
在这里和那里使用不同的值多次运行,我的结果确实是一致的和合乎逻辑的:
{'large': [179.60638297872342, 132.56382978723406], 'medium': [169.23076923076923, 96.38461538461539], 'small': [150.6595744680851, 75.7872340425532]}
但是,我无法使用 skikit learn 中的 KMeans 获得相同的结果
clusterer = KMeans(n_clusters=3,init='random',tol = 30)
X = data[['Height','Weight']]
clusterer.fit(X)
print(clusterer.cluster_centers_)
每次点击运行,我都会得到不同的结果。
我尝试更改 KMeans() 中的参数。我每次都有不同的结果,但逻辑上没有任何意义。
例如,这是其中一次运行的结果:
[[168.85227273 141.78409091]
[170.82432432 105.01351351]
[169.44578313 69.86746988]]