计算中的R簇之间的总平方和

问题描述 投票:0回答:2

我的目标是要比较这两个聚类方法我用cluster_method_1cluster_method_2有方格簇之和最大,以确定哪一个取得了较好的分离。

我基本上在寻找一个有效的方法来计算集群1的每个点和集群2,3,4各点之间的距离,等等。

例如数据帧:

structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848, 
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376, 
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849, 
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849, 
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684, 
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684, 
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328, 
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861, 
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L, 
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4", 
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L, 
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4", 
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))



        x1     x2     x3     x4 cluster_method_1 cluster_method_2
     <dbl>  <dbl>  <dbl>  <dbl> <fct>            <fct>           
 1  0.0176  0.536  0.647 -0.727 1                5               
 2 -1.15    0.886 -0.604  0.727 4                3               
 3  1.07   -0.945 -0.604  0.727 4                1               
 4  2.00   -2.24  -0.604 -0.727 4                3               
 5  0.0176 -1.81  -0.604 -0.727 2                4               
 6  0.944   1.83   0.647  0.727 2                2               
 7  2.00    0.886 -0.604  0.727 4                1               
 8  0.0176  0.969 -0.604 -0.727 2                1               
 9 -1.15    0.187 -0.604 -0.727 1                1               
10 -1.15   -0.679  1.62  -0.727 6                6  
r cluster-analysis
2个回答
1
投票

平方的总和,sum_x sum_y || X-Y ||²是恒定的。

平方的总和可以从平凡方差来计算。

如果你现在减去平方的范围内集群和其中x和y属于同一个集群,然后广场群之间总和保持。

如果你这样做的方式,它需要O(n)的时间,而不是O(N²)。

推论:用最小的WCSS的解决方案具有最大的BCSS。


3
投票

的(欧几里得)距离的平方,由点的数量的两倍该群集中分割求和的平方为群集的Si可以被写为所有成对的总和内(参见例如the Wikipedia article on k-means clustering

enter image description here

为了方便起见,我们定义一个函数calc_SS返回内的求和的平方为一个(数字)data.frame

calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))

这是直截了当然后向(簇)求和的平方内计算用于每个方法每个集群

library(tidyverse)
df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    spread(method, within_SS)
## A tibble: 6 x 3
#  cluster cluster_method_1 cluster_method_2
#  <chr>              <dbl>            <dbl>
#1 1                   1.52             9.99
#2 2                  10.3              0
#3 3                  NA               10.9
#4 4                  15.2              0
#5 5                  NA                0
#6 6                   0                0

加总法格内的总计是那么刚刚的内求和的平方,每簇总和

df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    group_by(method) %>%
    summarise(total_within_SS = sum(within_SS)) %>%
    spread(method, total_within_SS)
## A tibble: 1 x 2
#  cluster_method_1 cluster_method_2
#             <dbl>            <dbl>
#1             27.0             20.9 

顺便说一句,我们可以确认calc_SS确实内加总平方使用iris数据集返回:

set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster

df2 %>%
    group_by(cluster) %>%
    nest() %>%
    mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    arrange(cluster)
## A tibble: 3 x 3
#  cluster data              within_SS
#    <int> <list>                <dbl>
#1       1 <tibble [38 × 4]>      23.9
#2       2 <tibble [62 × 4]>      39.8
#3       3 <tibble [50 × 4]>      15.2

kmeans$within
#[1] 23.87947 39.82097 15.15100
© www.soinside.com 2019 - 2024. All rights reserved.