我有正则表达式列表:
regex_list <- list("First Name" = "^[A-Za-z]+$",
"Postal Code" = "^[0-9]{5}$",
"Email" = "^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}")
然后我有一个要分类的字符串列表:
strings <- c(
"John", "12345", "[email protected]", "InvalidString",
"Alice", "54321", "example.com", "Bob", "67890", "[email protected]",
"Charlie", "98765", "[email protected]", "David", "13579", "invalid.email",
"Eva", "24680", "[email protected]", "Frank", "11111", "frank@email"
)
现在,我想根据 regex_list 对每个字符串进行分类。虽然这可以使用两个嵌套循环来实现:
# Initialize an empty vector for categories
categories <- character(length(strings))
# Categorize the strings based on the regular expressions
for (i in 1:length(strings)) {
for (j in 1:length(regex_list)) {
if (grepl(regex_list[[j]], strings[i])) {
categories[i] <- names(regex_list)[j]
break
}
}
# If it doesn't fit into any category, set it to "No Category"
if (is.na(categories[i])) {
categories[i] <- "No Category"
}
}
...我正在考虑一种更优雅的方式来实现这一目标。可能是什么? :)
grepl()
在 x
上进行矢量化,因此您只需要一个 for
循环:
categories <- rep('No Category', length(strings))
matched <- rep(F, length(strings))
for (i in seq_along(regex_list)) {
categories[!matched][grepl(regex_list[[i]], strings[!matched])] <-
names(regex_list)[i]
}
在每个循环中仅匹配尚未分类的字符串。
cbind(strings, categories)
# strings categories
# [1,] "John" "First Name"
# [2,] "12345" "Postal Code"
# [3,] "[email protected]" "Email"
# [4,] "InvalidString" "First Name"
# [5,] "Alice" "First Name"
# [6,] "54321" "Postal Code"
# [7,] "example.com" "No Category"
# [8,] "Bob" "First Name"
# [9,] "67890" "Postal Code"
# [10,] "[email protected]" "Email"
# [11,] "Charlie" "First Name"
# [12,] "98765" "Postal Code"
# [13,] "[email protected]" "Email"
# [14,] "David" "First Name"
# [15,] "13579" "Postal Code"
# [16,] "invalid.email" "No Category"
# [17,] "Eva" "First Name"
# [18,] "24680" "Postal Code"
# [19,] "[email protected]" "Email"
# [20,] "Frank" "First Name"
# [21,] "11111" "Postal Code"
# [22,] "frank@email" "No Category"