我在一个字符串向量中有一大块文本(应用程序700.000字符串)。我正在尝试替换语料库中的特定单词/短语。也就是说,我有一个应用40.000短语的向量和相应的替换向量。
我正在寻找解决问题的有效方法
我可以在for循环中执行它,循环遍历每个模式+替换。但它严重缩放(3天左右!)
我也尝试过qdap :: mgsub(),但它看起来也很糟糕
txt <- c("this is a random sentence containing bca sk",
"another senctence with bc a but also with zqx tt",
"this sentence contains non of the patterns",
"this sentence contains only bc a")
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c("@a-specfic-tag-@abc sk",
"@a-specfic-tag-@bc a",
"@a-specfic-tag-@zqx tt")
#either
txt2 <- qdap::mgsub(patterns, replacements, txt)
#or
for(i in 1:length(patterns)){
txt <- gsub(patterns[i], replacements[i], txt)
}
这两种解决方案都可以通过应用40.000个模式/替换和700.000个txt字符串严重缩放我的数据
我认为必须有一种更有效的方法吗?
如果您可以先对文本进行标记,那么矢量化替换要快得多。如果a)你可以使用多线程解决方案和b)使用fixed而不是正则表达式匹配,它也会更快。
以下是如何在quanteda包中完成所有操作。最后一行将标记粘贴回单个“文档”作为字符向量,如果这是您想要的。
library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
quanteda_options(threads = 4)
txt <- c(
"this is a random sentence containing bca sk",
"another sentence with bc a but also with zqx tt",
"this sentence contains none of the patterns",
"this sentence contains only bc a"
)
patterns <- c("abc sk", "bc a", "zqx tt")
replacements <- c(
"@a-specfic-tag-@abc sk",
"@a-specfic-tag-@bc a",
"@a-specfic-tag-@zqx tt"
)
这将使文本标记化,然后使用固定模式匹配快速替换散列类型(但您可以使用valuetype = "regex"
进行正则表达式匹配)。通过将patterns
包装在phrases()
函数中,您告诉tokens_replace()
查找令牌序列而不是单个匹配,因此这解决了多字问题。
toks <- tokens(txt) %>%
tokens_replace(phrase(patterns), replacements, valuetype = "fixed")
toks
## tokens from 4 documents.
## text1 :
## [1] "this" "is" "a" "random" "sentence"
## [6] "containing" "bca" "sk"
##
## text2 :
## [1] "another" "sentence"
## [3] "with" "@a-specfic-tag-@bc a"
## [5] "but" "also"
## [7] "with" "@a-specfic-tag-@zqx tt"
##
## text3 :
## [1] "this" "sentence" "contains" "none" "of" "the"
## [7] "patterns"
##
## text4 :
## [1] "this" "sentence" "contains"
## [4] "only" "@a-specfic-tag-@bc a"
最后,如果你真的想把它放回字符格式,那么转换为字符类型列表然后将它们粘贴在一起。
sapply(as.list(toks), paste, collapse = " ")
## text1
## "this is a random sentence containing bca sk"
## text2
## "another sentence with @a-specfic-tag-@bc a but also with @a-specfic-tag-@zqx tt"
## text3
## "this sentence contains none of the patterns"
## text4
## "this sentence contains only @a-specfic-tag-@bc a"
你必须在你的大型语料库上测试它,但700k字符串听起来不是太大的任务。请尝试这个并报告它是如何做到的!
创建每个短语中所有单词的向量
txt1 = strsplit(txt, " ")
words = unlist(txt1)
使用match()
查找要替换的单词的索引,并替换它们
idx <- match(words, patterns)
words[!is.na(idx)] = replacements[idx[!is.na(idx)]]
重新组合短语并粘贴在一起
phrases = relist(words, txt1)
updt = sapply(phrases, paste, collapse = " ")
我想如果模式可以有多个单词,这将无效...
在旧值和新值之间创建一个映射
map <- setNames(replacements, patterns)
创建一个包含单个正则表达式中所有模式的模式
pattern = paste0("(", paste0(patterns, collapse="|"), ")")
查找所有匹配项,然后将其解压缩
ridx <- gregexpr(pattern, txt)
m <- regmatches(txt, ridx)
取消列表,映射并将匹配重新放置到其替换值,并更新原始向量
regmatches(txt, ridx) <- relist(map[unlist(m)], m)