使用R基于第n次出现的子串有效地分解字符串

问题描述 投票:1回答:2

介绍

给定R中的字符串,是否可以获得向量化解(即无循环),其中我们可以将字符串分解为块,其中每个块由字符串中第n次出现的子串确定。

使用可重复示例完成的工作

假设我们有几段着名的Lorem Ipsum文本。

library(strex)
# devtools::install_github("aakosm/lipsum")
library(lipsum)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

> my.string # (partial output)
# [1] "Lorem ipsum dolor ... id est laborum. "

我们希望在每次第3次出现单词“in”时将该文本分成段(为了区分包含“in”作为其一部分的单词,例如“min”,包括空格)。

我有一个while循环的以下解决方案:

# We wish to break up the string at every 
# 3rd occurence of the worn "in"

break.character = " in"
break.occurrence = 3
string.list = list()
i = 1

# initialize string to send into the loop
current.string = my.string

while(length(current.string) > 0){

  # Enter segment into the list which occurs BEFORE nth occurence character of interest
  string.list[[i]] = str_before_nth(current.string, break.character, break.occurrence)

  # Update next string to exmine.
  # Next string to examine is current string AFTER nth occurence of character of interest
  current.string = str_after_nth(current.string, break.character, break.occurrence)

  i = i + 1
}

我们能够在列表中获得所需的输出并带有警告(警告未显示)

> string.list (#partial output shown)
[[1]]
[1] "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit"

[[2]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"
...

[[6]]
[1] " voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor"

目标

是否有可能通过矢量化(即使用apply()lapply()mapply()等)来改进这种解决方案。此外,我当前的解决方案切断了块中最后一次出现的子串。

当前的解决方案可能不适用于极长的字符串(例如我们正在寻找具有第n个核苷酸子串的块的DNA序列)。

r string text text-mining pattern-mining
2个回答
1
投票

试试这个:

text_split=strsplit(text," in ")[[1]]

l=length(text_split)
n = floor(l/3)
Seq = seq(1,by=2,length.out = n)

L= list()
L=sapply(Seq, function(x){
  paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
})
if (l>(n*3)){
L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
}

最后一个条件是在in的数量不能被3整除的情况下。另外,在in粘贴的最后一个sapply()是因为我不知道你想用你的块分隔的in


1
投票

如果这样做,请告诉我。我会尽力让它更快。它将第三个in保留在代码块中。如果它有效,我也会对它进行注释。

library(lipsum)
library(stringi)

my.string = capture.output(lipsum(5))
my.string = paste(my.string, collapse = " ")

end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
end_of_strings <- end_of_strings[!duplicated(end_of_strings)]


stri_sub(my.string, start_of_strings, end_of_strings)

编辑:实际上,使用stri_substringi。它将比substring更好地扩展。看到:

my.string <- paste(rep(my.string, 10000), collapse = " ")
nchar(my.string)
[1] 22349999

microbenchmark::microbenchmark(
  sol1 = {
    text_split=strsplit(my.string," in ")[[1]]

    l=length(text_split)
    n = floor(l/3)
    Seq = seq(1,by=2,length.out = n)

    L= list()
    L=sapply(Seq, function(x){
      paste0(paste(text_split[x:(x+2)],collapse=" in ")," in ")
    })
    if (l>(n*3)){
      L = c(L,paste(text_split[(n*3+1):l],collapse=" in "))
    }
  },
  sol2 = {
    end_of_in <- stri_locate_all(fixed = " in ", my.string)[[1]][,2]
    start_of_strings <- c(1, end_of_in[c(F, F, T)]) 
    end_of_strings <- c(end_of_in[c(F, F, T)] - 1, nchar(my.string))
    end_of_strings <- end_of_strings[!duplicated(end_of_strings)]
    stri_sub(my.string, start_of_strings, end_of_strings)
  },
  times = 10
)

Unit: milliseconds
 expr      min        lq      mean    median        uq       max neval
 sol1 914.1268 927.45958 941.36117 939.80361 950.18099 980.86941    10
 sol2  55.4163  56.40759  58.53444  56.86043  57.03707  71.02974    10
© www.soinside.com 2019 - 2024. All rights reserved.