我正在尝试比较 data.frame 中的列以识别多个列中的唯一值。我有多个个人的数据,我已将这些数据从长格式转换为宽格式,并且想要识别个人的所有值。
df <- data_frame(ID=rep(1:10,3), Value = as.character(rep(list("A1,B7,B11,C2","A1,B5,B8,D3",""),10)))
这给出了数据子集可能是什么样子的示例(可能不是实现此目的的最巧妙方法,但似乎可行)
我已经开始:
df %>%
mutate(across(Value, ~na_if(., ""))) %>%
within({time <- ave(ID, list=ID, FUN=seq_along)}) %>%
pivot_wider(id_cols="ID",names_from="time",names_prefix="Value",values_from="Value") %>%
rowwise() %>%
mutate(Value = union(
str_split(Value1, ",", simplify = TRUE),
str_split(Value2, ",", simplify = TRUE)
) |> str_c(collapse = ", "))
这对于生成的前两列非常有效,但我不确定如何最好地合并第三列(或一般情况下更多的列,因为通常有 5 或更多)。
我曾考虑过一个循环,将每个新列与前一个循环的输出进行比较,但认为可能有一种更简单的方法。
如果我理解你想要做什么,那就是收集 Value1-Value3 变量中个人行的所有唯一值。如果是这种情况,您可以首先用逗号分隔它们,然后找到每行中的所有唯一值,然后将它们粘贴回一起。
library(dplyr)
library(tidyr)
library(stringr)
df <- tibble(ID=rep(1:10,3), Value = as.character(rep(list("A1,B7,B11,C2","A1,B5,B8,D3",""),10)))
df <- df %>%
mutate(across(Value, ~na_if(., ""))) %>%
within({time <- ave(ID, list=ID, FUN=seq_along)}) %>%
pivot_wider(id_cols="ID",names_from="time",names_prefix="Value",values_from="Value") %>%
rowwise() %>%
mutate(Value = union(
str_split(Value1, ",", simplify = TRUE),
str_split(Value2, ",", simplify = TRUE)
) |> str_c(collapse = ", "))
df %>%
select(-Value) %>%
rowwise() %>%
mutate(across(starts_with("Value"), ~list(str_split(.x, ",", simplify=TRUE)[1,]))) %>%
mutate(Value = list(na.omit(unique(c(unlist(pick(starts_with("Value")))))))) %>%
mutate(Value = paste(Value, collapse=","))
#> # A tibble: 10 × 5
#> # Rowwise:
#> ID Value1 Value2 Value3 Value
#> <int> <list> <list> <list> <chr>
#> 1 1 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#> 2 2 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#> 3 3 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#> 4 4 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#> 5 5 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#> 6 6 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#> 7 7 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
#> 8 8 <chr [4]> <chr [1]> <chr [4]> A1,B5,B8,D3,B7,B11,C2
#> 9 9 <chr [1]> <chr [4]> <chr [4]> A1,B7,B11,C2,B5,B8,D3
#> 10 10 <chr [4]> <chr [4]> <chr [1]> A1,B7,B11,C2,B5,B8,D3
创建于 2024-01-23,使用 reprex v2.0.2
不确定这是否是您想要做的,但这确实识别组内多行的唯一值。
从原始
df
开始,首先 strsplit
从字符串中获取单个值,然后找到 unique
值,最后 paste
组合在一起得到结果。
library(dplyr)
df %>%
reframe(Value = paste(unique(unlist(strsplit(Value, ","))), collapse=", "), .by = ID)
# A tibble: 10 × 2
ID Value
<int> <chr>
1 1 A1, B7, B11, C2, B5, B8, D3
2 2 A1, B5, B8, D3, B7, B11, C2
3 3 A1, B7, B11, C2, B5, B8, D3
4 4 A1, B7, B11, C2, B5, B8, D3
5 5 A1, B5, B8, D3, B7, B11, C2
6 6 A1, B7, B11, C2, B5, B8, D3
7 7 A1, B7, B11, C2, B5, B8, D3
8 8 A1, B5, B8, D3, B7, B11, C2
9 9 A1, B7, B11, C2, B5, B8, D3
10 10 A1, B7, B11, C2, B5, B8, D3
df <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), Value = c("A1,B7,B11,C2", "A1,B5,B8,D3",
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3",
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3",
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3",
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "", "A1,B7,B11,C2", "A1,B5,B8,D3",
"", "A1,B7,B11,C2", "A1,B5,B8,D3", "")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -30L))