手动重新创建正则表达式函数

问题描述 投票:0回答:1

作为学习练习,我正在尝试以手动方式在 R 中重新创建正则表达式。

例如,假设我有这个字符串:

var1 <- c("111 222 a1C 5b2", "B2G-6l3 atttr", "nothing here", "something P2b5p2 something")

我想看看每个元素是否具有连续的模式:字母,数字,字母,空格/无空格/分隔符,数字,字母,数字。

我尝试手动定义这个问题的条件:

cond_1 <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", 
            "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", 
            "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", 
            "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")

cond_2 <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")

cond_3 <- c("", " ", "-")

然后,我尝试编写一个循环来检查 var1 中的每个元素是否满足这些条件:

original_value <- c()
pattern_found <- c()
value <- c()

for (i in var1) {
    chars <- strsplit(i, "")[[1]]
    
    found <- FALSE
    
 
    for (j in 1:(length(chars) - 6)) {
        # Check if the pattern is found
        if (chars[j] %in% cond_1 && chars[j+1] %in% cond_2 && chars[j+2] %in% cond_1 &&
            chars[j+3] %in% cond_3 && chars[j+4] %in% cond_2 && chars[j+5] %in% cond_1 &&
            chars[j+6] %in% cond_2) {
            found <- TRUE
            break
        }
    }
    
  
    original_value <- c(original_value, i)
    pattern_found <- c(pattern_found, ifelse(found, "yes", "no"))
    value <- c(value, ifelse(found, paste(chars[j:(j+6)], collapse = ""), NA))
}


df <- data.frame(original_value, pattern_found, value)

代码似乎部分有效:

              original_value pattern_found   value
1            111 222 a1C 5b2           yes a1C 5b2
2              B2G-6l3 atttr           yes B2G-6l3
3               nothing here            no    <NA>
4 something P2b5p2 something            no    <NA>

我该如何解决这个问题?

PS:这是经典的正则表达式方法:

pattern <- "[a-zA-Z]\\d[a-zA-Z][- ,_]*\\d[a-zA-Z]\\d"

original_value <- c()
pattern_found <- c()
value <- c()

for (i in var1) {
  if (grepl(pattern, i)) {
    original_value <- c(original_value, i)
    pattern_found <- c(pattern_found, "yes")
    value <- c(value, regmatches(i, regexpr(pattern, i)))
  } else {
    original_value <- c(original_value, i)
    pattern_found <- c(pattern_found, "no")
    value <- c(value, NA)
  }
}

df <- data.frame(original_value, pattern_found, value)
r regex
1个回答
0
投票

您可以直接使用源向量手动构建具有字符类的正则表达式模式,例如

cond_1 <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", 
            "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", 
            "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", 
            "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")

cond_2 <- c("0", "1", "2", "3", "4", "5", "6", "7", "8", "9")

cond_3 <- c(" ", "-")

r1 <- paste0("[", paste(cond_1, collapse=""), "]")
r2 <- paste0("[", paste(cond_2, collapse=""), "]")
r3 <- paste0("[", paste(cond_3, collapse=""), "]")

regex <- paste0(r1, r2, r3, "?", r2, r1, r2)
regex

[1] "[abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][0123456789][ -]?[0123456789][abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ][0123456789]"
© www.soinside.com 2019 - 2024. All rights reserved.