如何在 R 中对字符串进行分组和连接但跳过一组?

问题描述 投票:0回答:1

我有一张桌子,里面有人们和他们所属的群体。它的格式如下:

person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)

我想创建一个列,为每个人显示他们在特定年份与哪些其他人共享一个组。这就是我目前拥有的:

example <- within(example, {
   connections <- ave(person_id, group_id, year, FUN = function(x) paste(x, collapse=',')
})
res <- example %>%
   group_by(person_id, year) %>%
   summarise(joint = paste(connections, collapse=','))

这接近我需要的,但我不想在每一行中包含 person_id。例如,我的代码创建一个新列,其第一个值为“A1,A2,B1,A1,C2”。我希望这个值是“A2,B1,C2”。在我的示例中,B1 在 2016 年没有与任何人共享组。我的代码生成“B1”的行值,但我希望此单元格为空字符串。我怎样才能实现这个目标?

此外,我正在处理的数据非常大,大约有 10 亿行。按两次分组似乎效率很低,但我不确定不这样做是否可以实现我想要做的事情。有没有更好的方法来解决这个问题?

注意:我无法使用 tidyr。

r
1个回答
0
投票

您可以首先在

aggregate
group_id + year
。使用
c
list
,而不是
paste
merge
与原始数据,
setdiff
person_id
将其排除。我们可以使用
replace
NA
清空单元格。

> aggregate(cbind(pid=person_id) ~ group_id + year, example, c) |> 
+   merge(example) |> 
+   within({
+     pid <- Vectorize(setdiff)(pid, person_id)
+     pid <- replace(pid, !lengths(pid), NA_character_)
+   })
   group_id year    pid person_id
1       abc 2015 A2, B1        A1
2       abc 2015 A1, B1        A2
3       abc 2015 A1, A2        B1
4       abc 2016     A2        A1
5       abc 2016     A1        A2
6       cdz 2015     C2        A1
7       cdz 2015     A1        C2
8       cdz 2016     C2        A1
9       cdz 2016     A1        C2
10      fjx 2016     NA        B1
11      ghe 2015     C1        A3
12      ghe 2015     A3        C1
13      ghe 2016     C1        A3
14      ghe 2016     A3        C1

数据:

> dput(example)
structure(list(person_id = c("A1", "A1", "A1", "A1", "A2", "A2", 
"A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2"), group_id = c("abc", 
"abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", 
"ghe", "ghe", "cdz", "cdz"), year = c(2015, 2016, 2015, 2016, 
2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)), class = "data.frame", row.names = c(NA, 
-14L))
© www.soinside.com 2019 - 2024. All rights reserved.