我想摘录以下两段文字 德 和 褝 以及字符串中不含 德 或 褝. 我对regex不是很在行,但在阅读了lookaheads和lookbehinds之后,我设法得到了部分我想要的东西。现在我必须把它们变成可选的,但无论我尝试了什么,我都无法做到正确。
library(stringr)
(sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}', 'extract this one', '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one"))
#> [1] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [2] "extract this one"
#> [3] "{\"de\":\"extract this one\",\"en\":\"some text\"}"
#> [4] "p (340) extract this one"
str_extract_all(string = sstring, pattern = "(?<=.de\":\").*(?=.,\"en\":)")
#> [[1]]
#> [1] "extract this one"
#>
#> [[2]]
#> character(0)
#>
#> [[3]]
#> [1] "extract this one"
#>
#> [[4]]
#> character(0)
所需的输出。
#> [1] "extract this one" "extract this one"
#> [3] "extract this one" "p (340) extract this one"
创建于2020-05-08,作者: 重读包 (v0.3.0)
在基础R中
gsub('.*de\":\"(.*)\",\"en.*',"\\1",sstring)
[1] "extract this one"
[2] "extract this one"
[3] "extract this one"
[4] "p (340) extract this one"
其中
.*
表示任何字符的任意长度(...)
托架存储里面的东西,后被回收的。"\\1"
本质上,是将整个字符串的匹配模式子化,只有我们想要的文本。我建议使用一种模式,可以匹配任何不含 {"de":"
子串或在 {"de":"
含有1个以上的字符,除了 "
:
(?<=\{"de":")[^"]+|^(?!.*\{"de":").+
见 搜索引擎演示.
详情
(?<=\{"de":")
- 前面的正向观察,寻找前面的位置。{"de":"
[^"]+
- 然后提取1个以上的字符,除了 "
|
- 或^
- 句首(?!.*\{"de":")
- 确保没有 {"de":"
在字符串中和.+
- 尽可能多地提取除换行符以外的1+字符。请看一个 R演示在线:
library(stringr)
sstring = c('{\"de\":\"extract this one\",\"en\":\"some text\"}', 'extract this one', '{\"de\":\"extract this one\",\"en\":\"some text\"}', "p (340) extract this one")
str_extract( sstring, '(?<=\\{"de":")[^"]+|^(?!.*\\{"de":").+')
# => [1] "extract this one" "extract this one"
# [3] "extract this one" "p (340) extract this one"