我有一个文本文件中的数据,其中包含多个列,我想以不丢失任何信息的方式处理数据,有些列包含两个或多个用特殊字符分隔的信息,例如“+”加号,我想把这些组合信息放在同一列的不同行中,例如我在下面粘贴数据
我的数据框如下所示
df <- data.frame(G1=c("GH13_22+CBM4", "GH109+PL7+GH9","GT57", "AA3","",""),
G2=c("GH13_22","","GT57+GH15","AA3", "GT41","PL+PL2"),
G3=c("GH13", "GH1O9","", "CBM34+GH13+CBM48", "GT41","GH16+CBM4+CBM54+CBM32"))
G1 G2 G3
1 GH13_22+CBM4 GH13_22 GH13
2 GH109+PL7+GH9 GH1O9
3 GT57 GT57+GH15
4 AA3 AA3 CBM34+GH13+CBM48
5 GT41 GT41
6 PL+PL2 GH16+CBM4+CBM54+CBM32
预期结果应该看起来像
df2 <- data.frame(G1=c("GH13_22","CBM4", "GH109","PL7","GH9","GT57", "AA3","","","",""),
G2=c("GH13_22","","GT57","GH15","AA3", "GT41","PL","PL2","","",""),
G3=c("GH13", "GH1O9","", "CBM34","GH13","CBM48", "GT41","GH16","CBM4","CBM54","CBM32"))
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 GH1O9
3 GH109 GT57
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 PL2 GH16
9 CBM4
10 CBM54
11 CBM32
感谢任何帮助 谢谢
另一个选项,灵感来自这篇文章中的@Peter M
library(tidyverse)
library(stringr)
# finds which vector is the longest and pads the other vectors accordingly
makePaddedDataFrame <- function(l){
maxlen <- max(sapply(l,length))
data.frame(lapply(l,\(x) x[1:maxlen])) # pads vectors with na
}
df %>%
mutate(across(.fns = function(x) str_split(x, pattern="\\+"))) %>%
lapply(function(x) do.call(c, x)) %>%
makePaddedDataFrame %>%
replace(is.na(.), " ") # if you want empty strings instead of na
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 GH1O9
3 GH109 GT57
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 PL2 GH16
9 CBM4
10 CBM54
11 CBM32
separate_rows()
已被 separate_longer_delim()
取代,因为它具有与其他独立函数更一致的 API。被取代的功能不会消失,但只会收到关键错误修复。 https://tidyr.tidyverse.org/reference/separate_rows.html
na_if
从dplyr
summarise(cur_data()[seq(max(id)), ])
我们将每个组扩展到id的最大值。library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
separate_longer_delim(value, "+") %>%
mutate(value = na_if(value, "")) %>%
group_by(name) %>%
mutate(id = row_number()) %>%
summarise(cur_data()[seq(max(id)), ]) %>%
pivot_wider(names_from = name, values_from = value)
id G1 G2 G3
<int> <chr> <chr> <chr>
1 1 GH13_22 GH13_22 GH13
2 2 CBM4 NA GH1O9
3 3 GH109 GT57 NA
4 4 PL7 GH15 CBM34
5 5 GH9 AA3 GH13
6 6 GT57 GT41 CBM48
7 7 AA3 PL GT41
8 8 NA PL2 GH16
9 9 NA NA CBM4
10 10 NA NA CBM54
11 11 NA NA CBM32
一个
base
解决方案:
split <- lapply(df, \(x) unlist(strsplit(replace(x, x == '', NA_character_), '\\+')))
as.data.frame(lapply(split, `[`, 1:max(lengths(split))))
G1 G2 G3
1 GH13_22 GH13_22 GH13
2 CBM4 <NA> GH1O9
3 GH109 GT57 <NA>
4 PL7 GH15 CBM34
5 GH9 AA3 GH13
6 GT57 GT41 CBM48
7 AA3 PL GT41
8 <NA> PL2 GH16
9 <NA> <NA> CBM4
10 <NA> <NA> CBM54
11 <NA> <NA> CBM32