假设我有一个像这样的数据框(我的问题的简化、类似版本):
ID <- c(1,2,3)
value <- c("1+4-3", "2+7-6+4-3", "-1+3")
df <- data.frame(ID, value)
ID value
1 1+4-3
2 2+7-6+4-3
3 -1+3
我需要通过多个分隔符(
value
和 +
)将 -
列拆分为多列,同时将分隔符保留在单独的列中。
生成的数据框应该是这样的:
ID x1 x2 x3 x4 x5 x6 x7 x8 x9
1 1 + 4 - 3 <NA> <NA> <NA> <NA>
2 2 + 7 - 6 + 4 - 3
3 - 1 + 3 <NA> <NA> <NA> <NA> <NA>
此外,我不知道需要多少个结果列(可能不是示例中的 9 个,而是 50 个)。
实现这一目标的最佳方法是什么?
谢谢
如果您的数字仅由
digits
组成,您可以尝试
df %>%
mutate(value = str_extract_all(value, "\\d+|\\D")) %>%
unnest(value) %>%
mutate(name = seq_len(n()), .by = ID) %>%
pivot_wider(names_prefix = "X")
这给出了
# A tibble: 2 × 10
ID X1 X2 X3 X4 X5 X6 X7 X8 X9
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 + 4 - 3 NA NA NA NA
2 2 2 + 7 - 6 + 4 - 3
你可以这样做:
library(tidyverse)
df |>
separate_longer_delim(cols = value, delim = regex("(?=\\+|-)")) |>
separate_longer_position(cols = value, width = 1) |>
mutate(pos = row_number(), .by = ID) |>
pivot_wider(values_from = value,
names_from = "pos",
names_prefix = "X")
# A tibble: 3 × 10
ID X1 X2 X3 X4 X5 X6 X7 X8 X9
<int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 + 4 - 3 NA NA NA NA
2 2 2 + 7 - 6 + 4 - 3
3 3 - 1 + 3 NA NA NA NA NA
我的做法:
ID <- c(1,2,3)
value <- c("1+4-3","2+7-6+4-3","25+110/2*214")
# added example 3 to show effect on numbers with >1 digit
df <- data.frame(ID,value)
df |> dplyr::mutate(
X = lapply(value, \(x) {
# split by word/nonword boundaries
y <- stringr::str_split(x, pattern = "\\b", simplify = TRUE)
# drop the empty first and last strings
y[nzchar(y)]
})) |> tidyr::unnest_wider(X, names_sep = "")
给予
# A tibble: 3 × 11
ID value X1 X2 X3 X4 X5 X6 X7 X8 X9
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1+4-3 1 + 4 - 3 NA NA NA NA
2 2 2+7-6+4-3 2 + 7 - 6 + 4 - 3
3 3 25+110/2*214 25 + 110 / 2 * 214 NA NA
如果您将管道移至
unnest_wider
,您会得到这个,IMO 在某些方面可能会更整洁:
ID value X
1 1 1+4-3 1, +, 4, -, 3
2 2 2+7-6+4-3 2, +, 7, -, 6, +, 4, -, 3
3 3 25+110/2*214 25, +, 110, /, 2, *, 214
您可以使用
separate_wider_delim()
中的 tidyr
:
library(tidyverse)
df %>%
rename(x = value) %>%
separate_wider_delim(x,
delim = stringr::regex("(?<=\\d)(?=\\D)|(?<=\\D)(?=\\d)"),
too_few = "align_start",
names_sep = '')
# A tibble: 3 × 10
ID x1 x2 x3 x4 x5 x6 x7 x8 x9
<dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 1 + 4 - 3 NA NA NA NA
2 2 2 + 7 - 6 + 4 - 3
3 3 - 1 + 3 NA NA NA NA NA
仅使用 r 基:
# Example data
ID <- c(1,2)
value <- c("1+4-3","2+7-6+4-3")
df <- data.frame(ID,value)
# max nchar
mrow <- max(nchar(df$value))
# split and arrange
df2 <- strsplit(df$value, split='')
df2 <- sapply(df2, function(x) c(x,rep(NA,mrow-length(x)))) # similar to fill=T in read.table
data.frame(ID=df$ID,t(df2))
ID X1 X2 X3 X4 X5 X6 X7 X8 X9
1 1 1 + 4 - 3 <NA> <NA> <NA> <NA>
2 2 2 + 7 - 6 + 4 - 3