如何按照特定顺序分割字符?

问题描述 投票:0回答:2

假设我在 R 中有一个特定的字符串,比如“ABCDEFG”。我可以使用以下正则表达式将其分成每两个字符的序列。

 strsplit("ABCDEFG", "(?<=(..))", perl = TRUE)
[[1]]
[1] "AB" "CD" "EF" "G" 

但我想把它分成一个特定的序列。前两个字符,然后是下一个字符,然后是两个字符,然后是一个字符,依此类推。

如果我的输入字符串是“ABCDEFG”,我想要“AB”“C”“DE”“F”“G”作为输出(在最后一个元素中只剩下一个元素)。

我该怎么办呢。我不想提前数

nchar
,因为我想动态地进行计算。

r regex split
2个回答
2
投票

您可以使用矢量化

substr
函数:

vsubstr <- Vectorize(substr)

x <- "ABCDEFG"

pat <- rep(c(1,2), length.out=1 + ceiling(nchar(x)/2))
start <- cumsum(pat)
stop <- start + rep(c(1,0), length.out=1 + ceiling(nchar(x)/2))

vsubstr(x, start, stop)

ABCDEFG    <NA>    <NA>    <NA>    <NA> 
   "AB"     "C"    "DE"     "F"     "G"


x <- "ABCDEFGH"
vsubstr(x, start, stop)

ABCDEFGH     <NA>     <NA>     <NA>     <NA> 
    "AB"      "C"     "DE"      "F"     "GH"

我承认不太优雅。您可以将所有丑陋的代码隐藏在函数中。

Two_one <- function(x) {
  vsubstr <- Vectorize(substr)
  pat <- rep(c(1,2), length.out=1 + ceiling(nchar(x)/2))
  start <- cumsum(pat)
  stop <- start + rep(c(1,0), length.out=1 + ceiling(nchar(x)/2))
  vsubstr(x, start, stop)
}

x <- "ABCDEFG"

Two_one(x)
ABCDEFG    <NA>    <NA>    <NA>    <NA> 
   "AB"     "C"    "DE"     "F"     "G"

0
投票

这里有一个想法。

> spl_pat <- \(x, p) {
+   stopifnot(all(is.na(p) | p >= 0))
+   if (identical(p, 0)) return('')
+   if (any(is.na(p))) return(x)  ## compatibility w/ strsplit()
+   if (identical(p, NULL)) p <- 1  ## compatibility w/ strsplit()
+   s <- strsplit(x, '')
+   lapply(s, \(x) {
+     xl <- length(x)
+     pl <- length(p)
+     u <- rep(p, max(xl/pl, 1))
+     o <- vapply(
+       split(x, cut(seq_along(x), c(0, cumsum(u), Inf))), 
+       paste, collapse='', FUN.VALUE=character(1))
+     unname(o[nzchar(o)])
+   })
+ }

使用方法

单字符串(向量长度== 1)

> spl_pat('ABCDEFG', 2:1)
[[1]]
[1] "AB" "C"  "DE" "F"  "G" 

> spl_pat('ABCDEFG', c(1, 4))
[[1]]
[1] "A"    "BCDE" "F"    "G"   

> spl_pat('ABCDEFG', 2)
[[1]]
[1] "AB" "CD" "EF" "G" 

> spl_pat('ABCDEFG', 1)
[[1]]
[1] "A" "B" "C" "D" "E" "F" "G"

> spl_pat('ABCDEFG', 0)
[1] ""
> spl_pat('ABCDEFG', NA)
[1] "ABCDEFG"
> spl_pat('ABCDEFG', NULL)
[[1]]
[1] "A" "B" "C" "D" "E" "F" "G"

向量长度> 1

> spl_pat(c('ABCDEFG', 'ABCDEFGHIJ'), 2:1)
[[1]]
[1] "AB" "C"  "DE" "F"  "G" 

[[2]]
[1] "AB" "C"  "DE" "F"  "GH" "I"  "J" 

> spl_pat(c('ABCDEFG', 'ABCDEFGHIJ'), 1:7)
[[1]]
[1] "A"   "BC"  "DEF" "G"  

[[2]]
[1] "A"    "BC"   "DEF"  "GHIJ"

> spl_pat('ABCDEFG', 1:1e3)
[[1]]
[1] "A"   "BC"  "DEF" "G"  

> Vectorize(spl_pat)(c('ABCDEFG', 'ABCDEFGHIJ'), list(2:1, 1:2))
$ABCDEFG
[1] "AB" "C"  "DE" "F"  "G" 

$ABCDEFGHIJ
[1] "A"  "BC" "D"  "EF" "G"  "HI" "J" 

不同的图案:

> Vectorize(spl_pat)(c('ABCDEFG', 'ABCDEFGHIJ', 'ABCDEFGHIJ'), list(2:1, 1:2, 0))
$ABCDEFG
[1] "AB" "C"  "DE" "F"  "G" 

$ABCDEFGHIJ
[1] "A"  "BC" "D"  "EF" "G"  "HI" "J" 

$ABCDEFGHIJ
[1] ""

< 0 probably wouldn't make sense, would it?:

> spl_pat('ABCDEFG', -1)
Error in spl_pat("ABCDEFG", -1) : all(is.na(p) | p >= 0) is not TRUE
© www.soinside.com 2019 - 2024. All rights reserved.