按匹配词条的数量对字符串进行有效排名

Question

摘要：如何最有效地计算多个正则表达式匹配并按发生率对结果进行排名？是否应该使用语义方法代替正则表达式？

用于说明的样本数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households - 25% sample data", 
"Number of employment income recipients aged 15 years and over in private households", 
"Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
"Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
"Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
"Number of market income recipients aged 15 years and over in private households", 
"Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
"Without employment income", "With after-tax income", "1 household maintainer", 
"Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

以及包含多个词条的示例字符串查询

sample_query <- c("after tax income")

使用grepl很容易检查字符串查询是否匹配。

sample_string[grepl(sample_query, sample_string)]

但是很明显，这在这里行不通，因为没有精确匹配，因为实际项是after-tax income。一种替代方法是将搜索查询分为多个部分并进行检查。

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")), collapse = "|"), sample_string)]

这会起作用，但会返回太多结果，因为它与任何这些术语的任何实例匹配。

[1] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[2] "Number of employment income recipients aged 15 years and over in private households"              
[3] "Number of market income recipients aged 15 years and over in private households"                  
[4] "Employment income (%)"                                                                            
[5] "Without employment income"                                                                        
[6] "With after-tax income"                                                                            
[7] "Spending 30% or more of income on shelter costs"

问题：如何根据单个匹配项的数量有效地返回最接近的匹配项？

应用一些答案here，并添加顺序和匹配项将导致怪异：

sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                          collapse = "|"),
                    sample_string)][order(-lengths(regmatches(
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)],
                      gregexpr(paste(unlist(
                        strsplit(sample_query, " +")
                      ),
                      collapse = "|"),
                      sample_string[grepl(paste(unlist(strsplit(sample_query, " +")),
                                                collapse = "|"),
                                          sample_string)])
                    )))]

哪个返回我想要的-至少具有一个匹配项的所有字符串的列表，按匹配项的数量排序。

[1] "With after-tax income"                                                                            
[2] "Number of market income recipients aged 15 years and over in private households - 25% sample data"
[3] "Number of employment income recipients aged 15 years and over in private households"              
[4] "Number of market income recipients aged 15 years and over in private households"                  
[5] "Employment income (%)"                                                                            
[6] "Without employment income"                                                                        
[7] "Spending 30% or more of income on shelter costs"

稍微清理一下上面的怪兽：

to_match <- paste(unlist(strsplit(sample_query, " +")),collapse = "|")
results <- sample_string[grepl(to_match,sample_string)]
results[order(-lengths(regmatches(results,gregexpr(to_match,results))))]

我可以接受这个，但是有没有办法使它更简洁？而且，我想知道的是，这是否甚至是解决此问题的最佳方法？

我知道stringr::str_count和stringi::stri_count_regex。这对于一个包，我正尝试避免添加其他依赖项，但是如果这些更加有效，则可以改用它。

或者，替代字符串距离是更好的选择吗？检查数千个长字符串时它还会更好吗？

目的是帮助用户找到相关的信息，也许有些语义上更有意义的东西。

Answer 1

我确定这可以改善，但这是使用Levenshtein Distance后您做的一种方法：

# Desired query scalar: actual_query => character vector
actual_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(actual_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
              sapply(seq_along(x), function(i){
                paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Combine the query with each of its elements: revised_query => character vector 
revised_query <- c(actual_query, unlist(strsplit(actual_query, "\\s+")))

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(revised_query), 
                    function(i){
                       adist(revised_query[i], ordered_ngram_count)
                      }
                    )), gsub("\\s+", "_", revised_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[sapply(seq_along(ncol(lev_dist_df)),
                                     function(i) {
                                       which.min(lev_dist_df[, i])
                                     })], sample_string,
          value = TRUE),
     sample_string,
     value = TRUE)

更干净的版本：

# Desired query scalar: sample_query => character vector
sample_query <- "after tax income"

# Separate words in query: query_words => character vector: 
query_words <- unlist(strsplit(tolower(sample_query), "[^a-z]+"))

# Calculate n (scalar) for n-grams: word_count => integer vector
word_count <- length(query_words)

# Split each word preserving any non-character values: 
# sentence_word_split => character vector
sentence_word_split <- strsplit(tolower(sample_string), "\\s+")

# Split original sentences into n-grams (relative to query length): 
# n_grams => list 
n_grams <- lapply(sentence_word_split, function(x){
    sapply(seq_along(x), function(i){
      paste(x[i:min(length(x), ((i+word_count)-1))], sep = " ", collapse = " ")
      }
    )
  }
)

# Rank ngrams based on the frequency of their occurence in sample string: 
# ordered_n_gram => character vector
ordered_ngram_count <- trimws(names(sort(table(unlist(n_grams)), decreasing = TRUE)), "both")

# Use levenshtein distance to determine similarity of revised_query 
# to the expressions in the ordered_ngram_count: lev_dist_df => data.frame
lev_dist_df <- setNames(data.frame(sapply(seq_along(sample_query), 
                                          function(i){
                                            adist(sample_query[i], ordered_ngram_count)
                                          })), gsub("\\s+", "_", sample_query))

# Example of applying function returning string element in sample string 
# with the minimum edit distance: sample_string element => stdout (console)
grep(grep(ordered_ngram_count[which.min(lev_dist_df[,1])], sample_string,
          value = TRUE), sample_string, value = TRUE)

数据：

sample_string <- c("Total - Main mode of commuting for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Total - Language used most often at work for the population in private households aged 15 years and over who worked since January 1, 2015 - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households - 25% sample data", 
                   "Number of employment income recipients aged 15 years and over in private households", 
                   "Total - Major field of study - Classification of Instructional Programs (CIP) 2016 for the population aged 15 years and over in private households - 25% sample data", 
                   "Total - Selected places of birth for the recent immigrant population in private households - 25% sample data", 
                   "Total - Commuting duration for the employed labour force aged 15 years and over in private households with a usual place of work or no fixed workplace address - 25% sample data", 
                   "Number of market income recipients aged 15 years and over in private households", 
                   "Employment income (%)", "Total - Aboriginal ancestry for the population in private households - 25% sample data", 
                   "Without employment income", "With after-tax income", "1 household maintainer", 
                   "Spending 30% or more of income on shelter costs", "Total - Highest certificate, diploma or degree for the population aged 25 to 64 years in private households - 25% sample data"
)

按匹配词条的数量对字符串进行有效排名

问题描述投票：1回答：1

1个回答

最新问题

按匹配词条的数量对字符串进行有效排名

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1