我有不同的算法将个体分为 A、B、C 或 D 组。
预测的分类数据集看起来像(一团糟!):
# sample data
df_orig = tibble(
Individuals = c(1, 2, 3, 4, 5, 6, 7),
Algorithm_1 = c("A", "B", "A", "C", "A", "A", "D"),
Algorithm_2 = c("B", "C", "B", "D", "C", "A", "D"),
Algorithm_3 = c("C", "D", "D", "B", "D", "B", "A"),
Algorithm_4 = c("D", "B", "C", "A", "B", "A", "A")
)
这些组只有模糊的重叠。 我现在想知道哪些类在各个算法中的个体中具有最大的重叠/交叉。
所需的输出可能类似于(仅是虚构的示例值)
算法_1 | 算法_2 | 算法_3 | 算法_4 | |
---|---|---|---|---|
最适合1 | A级 | B级 | 类_D | 类_C |
最佳适合2 | 类_C | A级 | B级 | 类_D |
最佳适合3 | B级 | B级 | 类_D | A级 |
最佳适合3 | 类_D | 类_C | 类_D | B级 |
... | ... | ... | ... | ... |
到目前为止,我已经尝试过设置理论方法
Reduce(intersect())
,但Reduce似乎在嵌套dplyr::
结构上效果不佳(或者我使用错误)
df_test <- df_orig %>%
rownames_to_column() %>%
rename(individual = rowname) %>%
mutate(individual = individual %>% as.numeric) %>%
pivot_longer(starts_with("A"), names_to="algorithm", values_to = "prediction") %>%
pivot_wider(values_from = individual, names_from = individual, names_prefix = "I_") %>%
nest(individuals = starts_with("I")) %>% mutate(individuals = lapply(`individuals`, function(x) x[!is.na(x)]))
Reduce(dplyr::intersect, df_test$individuals)
只给出
numeric(0)
; split
嵌套结构没有考虑基于算法和类预测的条件重叠。
我也考虑过相关方法,例如
# compute correlation
prediction_matrix <- model.matrix(~0+., data=df_orig %>% select(-individuals)) %>%
cor(use="pairwise.complete.obs")
但是,这给了我最好的成对相关性,而不是跨多个类的最佳多集。
我有点茫然,希望有聪明人可以帮忙吗?
从示例数据开始
df_orig
,您可以:
count_matches <- function(xs) unlist(Map(seq_along(xs), f = \(i) sum(xs[i] == xs[-i])))
## > count_matches(c('A', 'B', 'A'))
## [1] 1 0 1
library(dplyr)
library(tidyverse)
df_orig |>
pivot_longer(-Individuals, names_to = 'Algorithm', values_to = 'Class') |>
mutate(cnt = count_matches(Class), .by = Individuals) |>
summarise(cnt = sum(cnt), .by = c(Algorithm, Class)) |>
arrange(Algorithm, desc(cnt)) |>
mutate(Rank = row_number(),
Algorithm = gsub('lgorithm', '', Algorithm), ## shorten column labels
.by = Algorithm) |>
pivot_wider(names_from = Algorithm, values_from = c(Class, cnt),
names_vary = 'slowest'
)
输出:
## + # A tibble: 4 x 9
## Rank Class_A_1 cnt_A_1 Class_A_2 cnt_A_2 Class_A_3 cnt_A_3 Class_A_4 cnt_A_4
## <int> <chr> <int> <chr> <int> <chr> <int> <chr> <int>
## 1 1 A 2 A 2 A 1 A 3
## 2 2 B 1 D 1 C 0 B 1
## 3 3 D 1 B 0 D 0 D 0
## 4 4 C 0 C 0 B 0 C 0
A(算法)1 显示对 A 类的最高同意(与其他算法),在 2 个实例中与其他人共享他的分类(跨个人和算法)等。