需要找到一个模式并提取出来[关闭]

问题描述 投票:-6回答:1

我的数据框有这些行

"110231 validation 108871 validation 85933"
"21102 validation 93442 21232 validation 73769 26402 validation 127221 26402"
"99763 99763 validation 99763 validation 99763"
"validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099"

用逗号分隔的每个字符串都是一个新行,我需要提取出第一个验证和每行后面的数字。如何做呢 ?

每行的预期输出应该是

"validation 108871"
"validation 93442"
"validation 99763"
"validation 199022"
r regex
1个回答
1
投票

我将通过两次实现对此进行一次尝试。

首先,我将使用character矢量。如果您的框架在框架中,请将其替换为myframe$mycolumn

v <- c("110231 validation 108871 validation 85933",
"21102 validation 93442 21232 validation 73769 26402 validation 127221 26402",
"99763 99763 validation 99763 validation 99763",
"validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099")

提取“验证号码”匹配

re <- gregexpr("validation [0-9]+", v)
re
# [[1]]
# [1]  8 26
# attr(,"match.length")
# [1] 17 16
# attr(,"index.type")
# [1] "chars"
# attr(,"useBytes")
# [1] TRUE
# [[2]] ...

我们可以用regmatches提取匹配的子串:

regmatches(v, re)
# [[1]]
# [1] "validation 108871" "validation 85933" 
# [[2]]
# [1] "validation 93442"  "validation 73769"  "validation 127221"
# [[3]]
# [1] "validation 99763" "validation 99763"
# [[4]]
# [1] "validation 199022" "validation 122099" "validation 12209" 
# [4] "validation 199022" "validation 199022" "validation 122099"

现在我们有一个列表,其中每个字符串生成一个或多个匹配的子字符串。现在我们可以迭代列表并获得第一个元素。

sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022"

即使字符串不包含子字符串模式,这也不会失败:

v <- c(v, "nothing here")
re <- gregexpr("validation [0-9]+", v)
sapply(regmatches(v, re), `[`, 1)
# [1] "validation 108871" "validation 93442"  "validation 99763" 
# [4] "validation 199022" NA                 

其中NA表示没有匹配但仍保留字符串向量中的位置。

仅限gsub

首先,删除数字/空格,但不包括第一个“验证”:

gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE)
# [1] "validation 108871 validation 85933"                                                                        
# [2] "validation 93442 21232 validation 73769 26402 validation 127221 26402"                                     
# [3] "validation 99763 validation 99763"                                                                         
# [4] "validation 199022 validation 122099 validation 12209 validation 199022 validation 199022 validation 122099"

现在删除第一个“数字”后面的任何内容:

gsub("([0-9])\\b.*", "", gsub("^[0-9 ]*(?=validation)", "", v, perl=TRUE))
# [1] "validation 10887" "validation 9344"  "validation 9976"  "validation 19902"
热门问题
推荐问题
最新问题