假设我在 R 中有一个特定的字符串,比如“ABCDEFG”。我可以使用以下正则表达式将其分成每两个字符的序列。
strsplit("ABCDEFG", "(?<=(..))", perl = TRUE)
[[1]]
[1] "AB" "CD" "EF" "G"
但我想把它分成一个特定的序列。前两个字符,然后是下一个字符,然后是两个字符,然后是一个字符,依此类推。
如果我的输入字符串是“ABCDEFG”,我想要“AB”“C”“DE”“F”“G”作为输出(在最后一个元素中只剩下一个元素)。
我该怎么办呢。我不想提前数
nchar
,因为我想动态地进行计算。
您可以使用矢量化
substr
函数:
vsubstr <- Vectorize(substr)
x <- "ABCDEFG"
pat <- rep(c(1,2), length.out=1 + ceiling(nchar(x)/2))
start <- cumsum(pat)
stop <- start + rep(c(1,0), length.out=1 + ceiling(nchar(x)/2))
vsubstr(x, start, stop)
ABCDEFG <NA> <NA> <NA> <NA>
"AB" "C" "DE" "F" "G"
x <- "ABCDEFGH"
vsubstr(x, start, stop)
ABCDEFGH <NA> <NA> <NA> <NA>
"AB" "C" "DE" "F" "GH"
我承认不太优雅。您可以将所有丑陋的代码隐藏在函数中。
Two_one <- function(x) {
vsubstr <- Vectorize(substr)
pat <- rep(c(1,2), length.out=1 + ceiling(nchar(x)/2))
start <- cumsum(pat)
stop <- start + rep(c(1,0), length.out=1 + ceiling(nchar(x)/2))
vsubstr(x, start, stop)
}
x <- "ABCDEFG"
Two_one(x)
ABCDEFG <NA> <NA> <NA> <NA>
"AB" "C" "DE" "F" "G"
这里有一个想法。
> spl_pat <- \(x, p) {
+ stopifnot(all(is.na(p) | p >= 0))
+ if (identical(p, 0)) return('')
+ if (any(is.na(p))) return(x) ## compatibility w/ strsplit()
+ if (identical(p, NULL)) p <- 1 ## compatibility w/ strsplit()
+ s <- strsplit(x, '')
+ lapply(s, \(x) {
+ xl <- length(x)
+ pl <- length(p)
+ u <- rep(p, max(xl/pl, 1))
+ o <- vapply(
+ split(x, cut(seq_along(x), c(0, cumsum(u), Inf))),
+ paste, collapse='', FUN.VALUE=character(1))
+ unname(o[nzchar(o)])
+ })
+ }
单字符串(向量长度== 1)
> spl_pat('ABCDEFG', 2:1)
[[1]]
[1] "AB" "C" "DE" "F" "G"
> spl_pat('ABCDEFG', c(1, 4))
[[1]]
[1] "A" "BCDE" "F" "G"
> spl_pat('ABCDEFG', 2)
[[1]]
[1] "AB" "CD" "EF" "G"
> spl_pat('ABCDEFG', 1)
[[1]]
[1] "A" "B" "C" "D" "E" "F" "G"
> spl_pat('ABCDEFG', 0)
[1] ""
> spl_pat('ABCDEFG', NA)
[1] "ABCDEFG"
> spl_pat('ABCDEFG', NULL)
[[1]]
[1] "A" "B" "C" "D" "E" "F" "G"
向量长度> 1
> spl_pat(c('ABCDEFG', 'ABCDEFGHIJ'), 2:1)
[[1]]
[1] "AB" "C" "DE" "F" "G"
[[2]]
[1] "AB" "C" "DE" "F" "GH" "I" "J"
> spl_pat(c('ABCDEFG', 'ABCDEFGHIJ'), 1:7)
[[1]]
[1] "A" "BC" "DEF" "G"
[[2]]
[1] "A" "BC" "DEF" "GHIJ"
> spl_pat('ABCDEFG', 1:1e3)
[[1]]
[1] "A" "BC" "DEF" "G"
> Vectorize(spl_pat)(c('ABCDEFG', 'ABCDEFGHIJ'), list(2:1, 1:2))
$ABCDEFG
[1] "AB" "C" "DE" "F" "G"
$ABCDEFGHIJ
[1] "A" "BC" "D" "EF" "G" "HI" "J"
不同的图案:
> Vectorize(spl_pat)(c('ABCDEFG', 'ABCDEFGHIJ', 'ABCDEFGHIJ'), list(2:1, 1:2, 0))
$ABCDEFG
[1] "AB" "C" "DE" "F" "G"
$ABCDEFGHIJ
[1] "A" "BC" "D" "EF" "G" "HI" "J"
$ABCDEFGHIJ
[1] ""
< 0 probably wouldn't make sense, would it?:
> spl_pat('ABCDEFG', -1)
Error in spl_pat("ABCDEFG", -1) : all(is.na(p) | p >= 0) is not TRUE