我正在尝试在 R 中使用交叉联接编写类似的代码,就像我们在 proc SQL 中所做的那样。
但是,我无法编写代码,请帮助我。
数据如下:
data.frame(
all_names = c("PARENT1", "CHILD1", "CHILD2", "CHILD3", "PARENT2", "PARENT3", "CHILD4", "CHILD5", "CHILD6", "PARENT7", "CHILD7", "CHILD8")
)
all_names
1 PARENT1
2 CHILD1
3 CHILD2
4 CHILD3
5 PARENT2
6 PARENT3
7 CHILD4
8 CHILD5
9 CHILD6
10 PARENT7
11 CHILD7
12 CHILD8
预期输出:
data.frame(
parent= c("PARENT1", "PARENT1", "PARENT1", "PARENT1", "PARENT2", "PARENT2", "PARENT2", "PARENT3", "PARENT3", "PARENT3", "PARENT4", "PARENT4", "PARENT4"),
child=c("CHILD1", "CHILD2", "CHILD3", "CHILD4", "PARENT5", "PARENT6", "CHILD4", "CHILD5", "CHILD6", "CHILD7", "CHILD8")
)
parent child
1 PARENT1 CHILD1
2 PARENT1 CHILD2
3 PARENT1 CHILD3
4 PARENT2 CHILD4
5 PARENT2 CHILD5
6 PARENT2 CHILD6
7 PARENT3 CHILD4
8 PARENT3 CHILD5
9 PARENT3 CHILD6
10 PARENT4 CHILD7
11 PARENT4 CHILD8
我尝试了如下操作,但无法进一步进行并受到打击,我正在尝试进行交叉连接
data %>% mutate(seq=row_number())
child <- data %>% filter(stringr::str_detect(all_names,'CHILD'))
parent <- data %>% filter(stringr::str_detect(all_names,'PARENT'))
parent %>% cross_join(child) %>% filter(seq.x <= seq.y)
在示例中,输出中有 PARENT4,但输入中没有 PARENT4。假设输出中的 PARENT4 应该是 PARENT7。
这里的关键是使用
consecutive_id
来形成游程。
library(dplyr)
tmp <- dat %>%
mutate(row = row_number(),
runs = consecutive_id(grepl("PARENT", all_names)),
p = runs %% 2)
p <- tmp %>% filter(p == 1)
ch <- tmp %>% filter(p == 0) %>% mutate(runs = runs - 1)
p %>%
inner_join(ch, "runs", relationship = "many-to-many") %>%
select(parent = all_names.x, child = all_names.y)
给予
parent child
1 PARENT1 CHILD1
2 PARENT1 CHILD2
3 PARENT1 CHILD3
4 PARENT2 CHILD4
5 PARENT2 CHILD5
6 PARENT2 CHILD6
7 PARENT3 CHILD4
8 PARENT3 CHILD5
9 PARENT3 CHILD6
10 PARENT7 CHILD7
11 PARENT7 CHILD8