我有一张桌子,里面有人们和他们所属的群体。它的格式如下:
person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)
我想创建一个列,为每个人显示他们在特定年份与哪些其他人共享一个组。这就是我目前拥有的:
example <- within(example, {
connections <- ave(person_id, group_id, year, FUN = function(x) paste(x, collapse=',')
})
res <- example %>%
group_by(person_id, year) %>%
summarise(joint = paste(connections, collapse=','))
这接近我需要的,但我不想在每一行中包含 person_id。例如,我的代码创建一个新列,其第一个值为“A1,A2,B1,A1,C2”。我希望这个值是“A2,B1,C2”。在我的示例中,B1 在 2016 年没有与任何人共享组。我的代码生成“B1”的行值,但我希望此单元格为空字符串。我怎样才能实现这个目标?
此外,我正在处理的数据非常大,大约有 10 亿行。按两次分组似乎效率很低,但我不确定不这样做是否可以实现我想要做的事情。有没有更好的方法来解决这个问题?
注意:我无法使用 tidyr。
您可以首先在
aggregate
上group_id + year
。使用 c
或 list
,而不是 paste
。 merge
与原始数据,setdiff
与 person_id
将其排除。我们可以使用 replace
NA
清空单元格。
> aggregate(cbind(pid=person_id) ~ group_id + year, example, c) |>
+ merge(example) |>
+ within({
+ pid <- Vectorize(setdiff)(pid, person_id)
+ pid <- replace(pid, !lengths(pid), NA_character_)
+ })
group_id year pid person_id
1 abc 2015 A2, B1 A1
2 abc 2015 A1, B1 A2
3 abc 2015 A1, A2 B1
4 abc 2016 A2 A1
5 abc 2016 A1 A2
6 cdz 2015 C2 A1
7 cdz 2015 A1 C2
8 cdz 2016 C2 A1
9 cdz 2016 A1 C2
10 fjx 2016 NA B1
11 ghe 2015 C1 A3
12 ghe 2015 A3 C1
13 ghe 2016 C1 A3
14 ghe 2016 A3 C1
数据:
> dput(example)
structure(list(person_id = c("A1", "A1", "A1", "A1", "A2", "A2",
"A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"), group_id = c("abc",
"abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx",
"ghe", "ghe", "cdz", "cdz"), year = c(2015, 2016, 2015, 2016,
2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)), class = "data.frame", row.names = c(NA,
-14L))