摘要:如何最有效地计算多个正则表达式匹配并按发生率对结果进行排名?是否应该使用语义方法代替正则表达式?
用于说明的样本数据:
sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data",
"Number of market income recipients aged 15 years and over in private households - 25% sample data",
"Number of employment income recipients aged 15 years and over in private households",
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data",
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data",
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Number of market income recipients aged 15 years and over in private households",
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data",
"Without employment income", "With after-tax income", "1 household maintainer",
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)
以及包含多个词条的示例字符串查询
sample_query <- c("after tax income")
使用grepl
很容易检查字符串查询是否匹配。
sample_string[grepl(sample_query, sample_string)]
但是很明显,这在这里行不通,因为没有精确匹配,因为实际项是after-tax income
。一种替代方法是将搜索查询分为多个部分并进行检查。
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")), collapse = "|"), sample_string)]
这会起作用,但会返回太多结果,因为它与任何这些术语的任何实例匹配。
[1] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[2] "Number of employment income recipients aged 15 years and over in private households"
[3] "Number of market income recipients aged 15 years and over in private households"
[4] "Employment income (%)"
[5] "Without employment income"
[6] "With after-tax income"
[7] "Spending 30% or more of income on shelter costs"
问题:如何根据单个匹配项的数量有效地返回最接近的匹配项?
应用一些答案here,并添加顺序和匹配项将导致怪异:
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)][order(-lengths(regmatches(
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)],
gregexpr(paste(unlist(
strsplit(sample_query, " +")
),
collapse = "|"),
sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
collapse = "|"),
sample_string)])
)))]
哪个返回我想要的-至少具有一个匹配项的所有字符串的列表,按匹配项的数量排序。
[1] "With after-tax income"
[2] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[3] "Number of employment income recipients aged 15 years and over in private households"
[4] "Number of market income recipients aged 15 years and over in private households"
[5] "Employment income (%)"
[6] "Without employment income"
[7] "Spending 30% or more of income on shelter costs"
稍微清理一下上面的怪兽:
to_match <- paste(unlist(strsplit(sample_query, " +")),collapse = "|")
results <- sample_string[grepl(to_match,sample_string)]
results[order(-lengths(regmatches(results,gregexpr(to_match,results))))]
我可以接受这个,但是有没有办法使它更简洁?而且,我想知道的是,这是否甚至是解决此问题的最佳方法?
我知道stringr::str_count
和stringi::stri_count_regex
。这对于一个包,我正尝试避免添加其他依赖项,但是如果这些更加有效,则可以改用它。
或者,替代字符串距离是更好的选择吗?检查数千个长字符串时它还会更好吗?
目的是帮助用户找到相关的信息,也许有些语义上更有意义的东西。
我确定这可以改善,但这是使用Levenshtein Distance后您做的一种方法:
# Desired query scalar: actual_query => character vector
actual_query <- "after tax income"
# Separate words in query: query_words => character vector:
query_words <- unlist(strsplit(tolower(actual_query), "[^a-z]+"))
# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)
# Split each word preserving any non-character values:
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")
# Split original sentences into n-grams (relative to query length):
# n_grams => list
n_grams <- lapply(sentence_word_split, function(x){
sapply(seq_along(x), function(i){
paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
}
)
}
)
# Rank ngrams based on the frequency of their occurence in sample string:
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")
# Combine the query with each of its elements: revised_query => character vector
revised_query <- c(actual_query, unlist(strsplit(actual_query, "\\s+")))
# Use levenshtein distance to determine similarity of revised_query
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(revised_query),
function(i){
adist(revised_query[i], ordered_ngram_count)
}
)), gsub("\\s+", "_", revised_query))
# Example of applying function returning string element in sample string
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[sapply(seq_along(ncol(lev_dist_df)),
function(i) {
which.min(lev_dist_df[, i])
})], sample_string,
value = TRUE),
sample_string,
value = TRUE)
更干净的版本:
# Desired query scalar: sample_query => character vector
sample_query <- "after tax income"
# Separate words in query: query_words => character vector:
query_words <- unlist(strsplit(tolower(sample_query), "[^a-z]+"))
# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)
# Split each word preserving any non-character values:
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")
# Split original sentences into n-grams (relative to query length):
# n_grams => list
n_grams <- lapply(sentence_word_split, function(x){
sapply(seq_along(x), function(i){
paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
}
)
}
)
# Rank ngrams based on the frequency of their occurence in sample string:
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")
# Use levenshtein distance to determine similarity of revised_query
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(sample_query),
function(i){
adist(sample_query[i], ordered_ngram_count)
})), gsub("\\s+", "_", sample_query))
# Example of applying function returning string element in sample string
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[which.min(lev_dist_df[,1])], sample_string,
value = TRUE), sample_string, value = TRUE)
数据:
sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data",
"Number of market income recipients aged 15 years and over in private households - 25% sample data",
"Number of employment income recipients aged 15 years and over in private households",
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data",
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data",
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data",
"Number of market income recipients aged 15 years and over in private households",
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data",
"Without employment income", "With after-tax income", "1 household maintainer",
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)