R 中的快速部分字符串匹配

问题描述 投票:0回答:3

给定一个字符串向量

texts
和一个模式向量
patterns
,我想为每个文本找到任何匹配的模式。

对于小型数据集,这可以在 R 中使用

grepl
轻松完成:

patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")

# for each x in patterns
lapply( patterns, function(x){
  # match all texts against pattern x
  res = grepl( x, texts, fixed=TRUE )
  print(res)
  # do something with the matches
  # ...
})

这个解决方案是正确的,但它无法扩展。即使使用较大的数据集(约 500 个文本和模式),此代码也慢得令人尴尬,在现代机器上每秒只能解决约 100 个案例 - 考虑到这是一个粗略的字符串部分匹配,没有正则表达式(使用

 设置),这很荒谬fixed=TRUE
)。即使使
lapply
平行也不能解决问题。 有没有办法有效地重写这段代码?

谢谢, 穆龙

string r performance string-matching
3个回答
17
投票

使用

stringi
包 - 它甚至比 grepl 更快。检查基准! 我使用了@Martin-Morgan 帖子中的文字

require(stringi)
require(microbenchmark)

text = readLines("~/Desktop/pg100.txt")
pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]

grepl_fun <- function(){
    lapply(pattern, grepl, text, fixed=TRUE)
}

stri_fixed_fun <- function(){
    lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}

#        microbenchmark(grepl_fun(), stri_fixed_fun())
#    Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
#     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100

# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()

for(i in seq_along(xx)){
    print(all(xx[[i]] == stri[[i]]))
}

9
投票

您是否准确描述了您的问题和您所看到的表现?这是威廉·莎士比亚全集以及针对它们的查询

text = readLines("~/Downloads/pg100.txt")
pattern <- 
    strsplit("all the world's a stage and all the people players", " ")[[1]]

这似乎比你暗示的要高效得多?

> length(text)
[1] 124787
> system.time(xx <- lapply(pattern, grepl, text, fixed=TRUE))
   user  system elapsed 
  0.444   0.001   0.444 
## avoid retaining memory; 500 x 500 case; no blank lines
> text = text[nzchar(text)]
> system.time({ for (p in rep(pattern, 50)) grepl(p, text[1:500], fixed=TRUE) })
   user  system elapsed 
  0.096   0.000   0.095 

我们期望对图案和文本的长度(元素数量)进行线性缩放。看来我记错了我的莎士比亚

> idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
> range(idx)
[1] 0 7
> sum(idx == 7)
[1] 8
> text[idx == 7]
[1] "    And all the men and women merely players;"                       
[2] "    cicatrices to show the people when he shall stand for his place."
[3] "    Scandal'd the suppliants for the people, call'd them"            
[4] "    all power from the people, and to pluck from them their tribunes"
[5] "    the fashion, and so berattle the common stages (so they call"    
[6] "    Which God shall guard; and put the world's whole strength"       
[7] "    Of all his people and freeze up their zeal,"                     
[8] "    the world's end after my name-call them all Pandars; let all"    

0
投票

尝试使用

stringfish
软件包,它甚至可以比
stringi
:

更快
library(stringfish)

text <- readLines(url("https://www.gutenberg.org/cache/epub/100/pg100.txt"))
pattern <- sf_split("all the world's a stage and all the people players", " ")[[1]]

lapply(pattern,\(x) sf_grepl(text, x, fixed = TRUE, nthreads = 4L))

基准

library(stringfish)
library(stringi)
library(microbenchmark)

microbenchmark(
  stringi = lapply(pattern, function(x) stri_detect_fixed(text, x)),
  grepl = lapply(pattern, grepl, text, fixed=TRUE),
  starfish = lapply(pattern,\(x) sf_grepl(text, x, fixed = TRUE, nthreads = 4L)),
  unit = "relative",
  times = 30L
)

Unit: relative
     expr       min        lq     mean    median        uq       max neval cld
  stringi  1.205221  1.182551  1.26320  1.184777  1.193862  2.908447    30 a  
    grepl 22.199882 21.533002 21.11705 21.413562 21.039191 18.994737    30  b 
 starfish  1.000000  1.000000  1.00000  1.000000  1.000000  1.000000    30   c
© www.soinside.com 2019 - 2024. All rights reserved.