我有更多的一般性问题。我有一个如下所示的数据框,由可以检查多个项目(至少 3 个,最多 6 个)的 ID 组成。
id item_1 item_2 item_3 item_4 item_5 item_6
1 13103802 13060661 13339404 12896842 13308823 NA
2 448361 497992 13103802* 13002842 NA NA
3 13031560 13103802* 13268709 2139908 1954965 12930979
4 13060661* 13339404* 446881 13406902 NA NA
5 12980231 12980231 12980231 NA NA NA
6 12896842* 13339404* 12717215 444032 13308823* NA
7 2098716 449342 13339070 12993196 2649922 NA
8 2678151 12700906 12903744 2623298 12736032 349511
9 2501765 2534504 2629353 NA NA NA
10 12955428 12766447 12944593 NA NA NA
现在对于每个 id,我想计算有多少其他 id 共享相似的 1 到 6 个项目。所以最后我想再添加 6 列,其中包含共享 1 个项目、2 个项目等的行数。
因此,根据上面的数据,对于第一行,“1 项”列的值为 4,因为它至少与第 2、3、4 和 6 行共享一个项目,“2 项”列的值为2 因为它与第 4 行和第 6 行共享至少 2 个项目,而“3 项”列的值为 1,因为它与第 6 行共享至少 3 个项目,依此类推(我在其他行中标记了与第一行共享的值*所以我希望它更明显)。
不确定方法,有人可以帮忙吗?
您可以使用
table
计算每列的 match
ing 值的数量,然后应用 rowSums
:
library(dplyr)
tab <- table(unlist(df[-1]))
df <-
df %>%
mutate(across(contains("item"), ~ tab[match(.x, names(tab))] - 1,
.names = "count{gsub('item', '', col)}"),
count = rowSums(across(contains("count")) > 0 , na.rm = TRUE))
id item_1 item_2 item_3 item_4 item_5 item_6 count_1 count_2 count_3 count_4 count_5 count_6 count
1 1 13103802 13060661 13339404 12896842 13308823 NA 2 1 2 1 1 NA 5
2 2 448361 497992 13103802 13002842 NA NA 0 0 2 0 NA NA 1
3 3 13031560 13103802 13268709 2139908 1954965 12930979 0 2 0 0 0 0 1
4 4 13060661 13339404 446881 13406902 NA NA 1 2 0 0 NA NA 2
5 5 12980231 12980231 12980231 NA NA NA 2 2 2 NA NA NA 3
6 6 12896842 13339404 12717215 444032 13308823 NA 1 2 0 0 1 NA 3
7 7 2098716 449342 13339070 12993196 2649922 NA 0 0 0 0 0 NA 0
8 8 2678151 12700906 12903744 2623298 12736032 349511 0 0 0 0 0 0 0
9 9 2501765 2534504 2629353 NA NA NA 0 0 0 NA NA NA 0
10 10 12955428 12766447 12944593 NA NA NA 0 0 0 NA NA NA 0
然后可能
table
在count
:
table(df$count)
#0 1 2 3 5
#4 2 1 2 1
有点凌乱和令人费解的
tidyverse
,但你可以尝试一下。如果这接近您的需要,请告诉我。
library(tidyverse)
df %>%
pivot_longer(-id, names_pattern = "(\\d+)$") %>%
filter(!is.na(value)) %>%
mutate(n = n_distinct(id), ids = list(unique(id)), .by = value) %>%
unnest(ids) %>%
filter(id != ids) %>%
reframe(freq = as.numeric(table(ids)), .by = id) %>%
right_join(expand_grid(id = df$id, col = 1:(ncol(df)-1)), by = "id", multiple = "all") %>%
replace_na(list(freq = 0)) %>%
reframe(value = sum(freq >= col), .by = c("id", "col")) %>%
pivot_wider(id_cols = id, names_from = col, values_from = value, names_prefix = "count") %>%
arrange(id)
输出
id count1 count2 count3 count4 count5 count6
<int> <int> <int> <int> <int> <int> <int>
1 1 4 2 1 0 0 0
2 2 2 0 0 0 0 0
3 3 2 0 0 0 0 0
4 4 2 1 0 0 0 0
5 5 0 0 0 0 0 0
6 6 2 1 1 0 0 0
7 7 0 0 0 0 0 0
8 8 0 0 0 0 0 0
9 9 0 0 0 0 0 0
10 10 0 0 0 0 0 0
数据
df <- structure(list(id = 1:10, item_1 = c(13103802L, 448361L, 13031560L,
13060661L, 12980231L, 12896842L, 2098716L, 2678151L, 2501765L,
12955428L), item_2 = c(13060661L, 497992L, 13103802L, 13339404L,
12980231L, 13339404L, 449342L, 12700906L, 2534504L, 12766447L
), item_3 = c(13339404L, 13103802L, 13268709L, 446881L, 12980231L,
12717215L, 13339070L, 12903744L, 2629353L, 12944593L), item_4 = c(12896842L,
13002842L, 2139908L, 13406902L, NA, 444032L, 12993196L, 2623298L,
NA, NA), item_5 = c(13308823L, NA, 1954965L, NA, NA, 13308823L,
2649922L, 12736032L, NA, NA), item_6 = c(NA, NA, 12930979L, NA,
NA, NA, NA, 349511L, NA, NA)), class = "data.frame", row.names = c(NA,
-10L))