我的目标是要比较这两个聚类方法我用cluster_method_1
和cluster_method_2
有方格簇之和最大,以确定哪一个取得了较好的分离。
我基本上在寻找一个有效的方法来计算集群1的每个点和集群2,3,4各点之间的距离,等等。
例如数据帧:
structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848,
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376,
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849,
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849,
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684,
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684,
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328,
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861,
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L,
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4",
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L,
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4",
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
x1 x2 x3 x4 cluster_method_1 cluster_method_2
<dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 0.0176 0.536 0.647 -0.727 1 5
2 -1.15 0.886 -0.604 0.727 4 3
3 1.07 -0.945 -0.604 0.727 4 1
4 2.00 -2.24 -0.604 -0.727 4 3
5 0.0176 -1.81 -0.604 -0.727 2 4
6 0.944 1.83 0.647 0.727 2 2
7 2.00 0.886 -0.604 0.727 4 1
8 0.0176 0.969 -0.604 -0.727 2 1
9 -1.15 0.187 -0.604 -0.727 1 1
10 -1.15 -0.679 1.62 -0.727 6 6
平方的总和,sum_x sum_y || X-Y ||²是恒定的。
平方的总和可以从平凡方差来计算。
如果你现在减去平方的范围内集群和其中x和y属于同一个集群,然后广场群之间总和保持。
如果你这样做的方式,它需要O(n)的时间,而不是O(N²)。
推论:用最小的WCSS的解决方案具有最大的BCSS。
的(欧几里得)距离的平方,由点的数量的两倍该群集中分割求和的平方为群集的Si可以被写为所有成对的总和内(参见例如the Wikipedia article on k-means clustering)
为了方便起见,我们定义一个函数calc_SS
返回内的求和的平方为一个(数字)data.frame
calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))
这是直截了当然后向(簇)求和的平方内计算用于每个方法每个集群
library(tidyverse)
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
spread(method, within_SS)
## A tibble: 6 x 3
# cluster cluster_method_1 cluster_method_2
# <chr> <dbl> <dbl>
#1 1 1.52 9.99
#2 2 10.3 0
#3 3 NA 10.9
#4 4 15.2 0
#5 5 NA 0
#6 6 0 0
加总法格内的总计是那么刚刚的内求和的平方,每簇总和
df %>%
gather(method, cluster, cluster_method_1, cluster_method_2) %>%
group_by(method, cluster) %>%
nest() %>%
transmute(
method,
cluster,
within_SS = map_dbl(data, ~calc_SS(.x))) %>%
group_by(method) %>%
summarise(total_within_SS = sum(within_SS)) %>%
spread(method, total_within_SS)
## A tibble: 1 x 2
# cluster_method_1 cluster_method_2
# <dbl> <dbl>
#1 27.0 20.9
顺便说一句,我们可以确认calc_SS
确实内加总平方使用iris
数据集返回:
set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster
df2 %>%
group_by(cluster) %>%
nest() %>%
mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
arrange(cluster)
## A tibble: 3 x 3
# cluster data within_SS
# <int> <list> <dbl>
#1 1 <tibble [38 × 4]> 23.9
#2 2 <tibble [62 × 4]> 39.8
#3 3 <tibble [50 × 4]> 15.2
kmeans$within
#[1] 23.87947 39.82097 15.15100