如何测量R中的K均值聚类的性能？ [包含图片和代码]

Question

我目前正在对公司的一些客户数据进行K-means聚类分析。我想衡量该集群的性能，我只是不知道用来衡量其性能的库软件包，而且我也不确定我的集群是否过于紧密地分组在一起。

提供给群集的数据是一个简单的RFM（汇率，频率和货币价值）。我还包括了客户每笔交易的平均订单价值。我使用了弯头法来确定要使用的最佳数量簇。数据包含1400个客户和4个指标值。

还附有群集图和R代码的图像

drop = c('CUST_Business_NM')

#Cleaning & Scaling the Data
new_cluster_data = na.omit(data)
new_cluster_data = data[, !(names(data)%in%drop)]
new_cluster_data = scale(new_cluster_data)
glimpse(new_cluster_data)

#Elbow Method for Optimal Clusters
k.max <- 15
data <- new_cluster_data
wss <- sapply(1:k.max, 
              function(k){kmeans(data, k, nstart=50,iter.max = 15 )$tot.withinss})
#Plot out the Elbow
wss
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

#Create the Cluster
kmeans_test = kmeans(new_cluster_data, centers = 8, nstart = 1000)
View(kmeans_test$cluster)

#Visualize the Cluster
fviz_cluster(kmeans_test, data = new_cluster_data,  show.clust.cent = TRUE, geom = c("point", "text"))

Answer 1

您可能不想测量cluster的性能，而是cluster algorithm的性能，在这种情况下为kmeans。

首先，您需要清楚要使用的cluster distance measure。集群计算的结果是dissimilarity matrix，因此距离度量的选择很关键，您可以使用euclidean，manhattan，任何种类的correlation或其他距离度量，例如：

library("factoextra")
dis_pearson <- get_dist(yourdataset, method = "pearson")
dis_pearson
fviz_dist(dis_pearson)

这将为您提供距离矩阵并将其可视化。

kmeans的输出具有几位信息。关于您的问题最重要的是：

totss:平方和
withinss:簇内平方和的向量
[tot.withinss:集群内的总平方和]
betweenss:簇间平方和

因此，目标是优化这些通过玩距离和其他方法对数据进行聚类。使用cluster包，您可以简单地通过mycluster <- kmeans(yourdataframe, centers = 2)提取这些度量，然后调用mycluster。

旁注：kmeans需要用户定义的簇数（需要额外的努力，并且对异常值非常敏感。

如何测量R中的K均值聚类的性能？ [包含图片和代码]

问题描述投票：0回答：1

1个回答

最新问题

如何测量R中的K均值聚类的性能？ [包含图片和代码]

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1