如何在 R 中跨两列以上使用“并集”?

问题描述 投票:0回答:2

我正在尝试比较 data.frame 中的列以识别多个列中的唯一值。我有多个个人的数据,我已将这些数据从长格式转换为宽格式,并且想要识别个人的所有值。

df <- data_frame(ID=rep(1:10,3), Value = as.character(rep(list("A1,B7,B11,C2","A1,B5,B8,D3",""),10)))

这给出了数据子集可能是什么样子的示例(可能不是实现此目的的最巧妙方法,但似乎可行)

我已经开始:

df %>%
  mutate(across(Value, ~na_if(., ""))) %>%
  within({time <- ave(ID, list=ID, FUN=seq_along)}) %>% 
  pivot_wider(id_cols="ID",names_from="time",names_prefix="Value",values_from="Value") %>%
  rowwise() %>%
  mutate(Value = union(
    str_split(Value1, ",", simplify = TRUE),
    str_split(Value2, ",", simplify = TRUE)
  ) |> str_c(collapse = ", "))

这对于生成的前两列非常有效,但我不确定如何最好地合并第三列(或一般情况下更多的列,因为通常有 5 或更多)。

我曾考虑过一个循环,将每个新列与前一个循环的输出进行比较,但认为可能有一种更简单的方法。

r string compare
2个回答
0
投票

如果我理解你想要做什么,那就是收集 Value1-Value3 变量中个人行的所有唯一值。如果是这种情况,您可以首先用逗号分隔它们,然后找到每行中的所有唯一值,然后将它们粘贴回一起。

library(dplyr)
library(tidyr)
library(stringr)
df <- tibble(ID=rep(1:10,3), Value = as.character(rep(list("A1,B7,B11,C2","A1,B5,B8,D3",""),10)))

df <- df %>%
  mutate(across(Value, ~na_if(., ""))) %>%
  within({time <- ave(ID, list=ID, FUN=seq_along)}) %>% 
  pivot_wider(id_cols="ID",names_from="time",names_prefix="Value",values_from="Value") %>%
  rowwise() %>%
  mutate(Value = union(
    str_split(Value1, ",", simplify = TRUE),
    str_split(Value2, ",", simplify = TRUE)
  ) |> str_c(collapse = ", "))


df %>% 
  select(-Value) %>% 
  rowwise() %>% 
  mutate(across(starts_with("Value"), ~list(str_split(.x, ",", simplify=TRUE)[1,]))) %>% 
  mutate(Value = list(na.omit(unique(c(unlist(pick(starts_with("Value")))))))) %>%
  mutate(Value = paste(Value, collapse=","))
#> # A tibble: 10 × 5
#> # Rowwise: 
#>       ID Value1    Value2    Value3    Value                
#>    <int> <list>    <list>    <list>    <chr>                
#>  1     1 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#>  2     2 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#>  3     3 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#>  4     4 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#>  5     5 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#>  6     6 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#>  7     7 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#>  8     8 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#>  9     9 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#> 10    10 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3

创建于 2024-01-23,使用 reprex v2.0.2


0
投票

不确定这是否是您想要做的,但这确实识别组内多行的唯一值

从原始

df
开始,首先
strsplit
从字符串中获取单个值,然后找到
unique
值,最后
paste
组合在一起得到结果。

library(dplyr)

df %>% 
  reframe(Value = paste(unique(unlist(strsplit(Value, ","))), collapse=", "), .by = ID)
# A tibble: 10 × 2
      ID Value                      
   <int> <chr>                      
 1     1 A1, B7, B11, C2, B5, B8, D3
 2     2 A1, B5, B8, D3, B7, B11, C2
 3     3 A1, B7, B11, C2, B5, B8, D3
 4     4 A1, B7, B11, C2, B5, B8, D3
 5     5 A1, B5, B8, D3, B7, B11, C2
 6     6 A1, B7, B11, C2, B5, B8, D3
 7     7 A1, B7, B11, C2, B5, B8, D3
 8     8 A1, B5, B8, D3, B7, B11, C2
 9     9 A1, B7, B11, C2, B5, B8, D3
10    10 A1, B7, B11, C2, B5, B8, D3

数据

df <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L), Value = c("A1,B7,B11,C2", "A1,B5,B8,D3", 
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3", 
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3", 
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3", 
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3", 
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -30L))
最新问题
© www.soinside.com 2019 - 2024. All rights reserved.