R 中使用正则表达式进行确定性分类?

问题描述 投票:0回答:1

我有正则表达式列表:

regex_list <- list("First Name" = "^[A-Za-z]+$",
                   "Postal Code" = "^[0-9]{5}$",
                   "Email" = "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")

然后我有一个要分类的字符串列表:

strings <- c(
  "John", "12345", "[email protected]", "InvalidString", 
  "Alice", "54321", "example.com", "Bob", "67890", "[email protected]",
  "Charlie", "98765", "[email protected]", "David", "13579", "invalid.email",
  "Eva", "24680", "[email protected]", "Frank", "11111", "frank@email"
)

现在,我想根据 regex_list 对每个字符串进行分类。虽然这可以使用两个嵌套循环来实现:

# Initialize an empty vector for categories
categories <- character(length(strings))

# Categorize the strings based on the regular expressions
for (i in 1:length(strings)) {
  for (j in 1:length(regex_list)) {
    if (grepl(regex_list[[j]], strings[i])) {
      categories[i] <- names(regex_list)[j]
      break
    }
  }
  # If it doesn't fit into any category, set it to "No Category"
  if (is.na(categories[i])) {
    categories[i] <- "No Category"
  }
}

...我正在考虑一种更优雅的方式来实现这一目标。可能是什么? :)

r loops classification text-classification
1个回答
0
投票

grepl()
x
上进行矢量化,因此您只需要一个
for
循环:

categories <- rep('No Category', length(strings))
matched <- rep(F, length(strings))

for (i in seq_along(regex_list)) {
  categories[!matched][grepl(regex_list[[i]], strings[!matched])] <- 
    names(regex_list)[i]
}

在每个循环中仅匹配尚未分类的字符串。

cbind(strings, categories)
#       strings                 categories   
#  [1,] "John"                  "First Name" 
#  [2,] "12345"                 "Postal Code"
#  [3,] "[email protected]"    "Email"      
#  [4,] "InvalidString"         "First Name" 
#  [5,] "Alice"                 "First Name" 
#  [6,] "54321"                 "Postal Code"
#  [7,] "example.com"           "No Category"
#  [8,] "Bob"                   "First Name" 
#  [9,] "67890"                 "Postal Code"
# [10,] "[email protected]"   "Email"      
# [11,] "Charlie"               "First Name" 
# [12,] "98765"                 "Postal Code"
# [13,] "[email protected]" "Email"      
# [14,] "David"                 "First Name" 
# [15,] "13579"                 "Postal Code"
# [16,] "invalid.email"         "No Category"
# [17,] "Eva"                   "First Name" 
# [18,] "24680"                 "Postal Code"
# [19,] "[email protected]" "Email"      
# [20,] "Frank"                 "First Name" 
# [21,] "11111"                 "Postal Code"
# [22,] "frank@email"           "No Category"
© www.soinside.com 2019 - 2024. All rights reserved.