我正在尝试在参考字符串和目标字符串之间进行近似匹配。
我已经尝试了在R中使用各种可用距离的adist和stringdist。
虽然算法很好地完成了仅用字母匹配字符串的工作,但是无法匹配存在数字和特殊字符(%等)的字符串。
如何处理此案?
以下是我的代码。
library(stringdist)
dist.name <- outer(tolower(WW_name),tolower(Px_name),
stringdist::stringdist, method = "lcs")
# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)
match.s1.s2<-NULL
chk <- function(x){
s2.i<-match(min.name[x],dist.name[x,])
s1.i<-x
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,WWname=WW_lookup[s1.i],
Pxname=Px_lookup[s2.i],
adist=min.name[x]),match.s1.s2)
return(match.s1.s2)
}
outDf <- lapply(1:nrow(dist.name),FUN = chk)
outDf <- do.call(rbind.data.frame, outDf)
匹配不正确的示例-
Pxname是来自算法的匹配| MappedPxName手动映射
任何建议将不胜感激。
这是我的尝试。由于您无法共享数据,因此我为您提供的服务有限。但是我希望以下内容可以帮助您解决现有问题。我认为这里的挑战是将数字转换为书面数字。这是我们可以用英语包处理的。我创建了两个向量(即WWname和PXname)。我还创建了自己的函数myfun
。这可以处理一些字符串操作,例如将%转换为百分比,以及将数字转换为书面数字。我使用了此功能,并创建了两个数据框(即ww和px)。
library(tidyverse)
library(english)
library(stringi)
library(stringdist)
WWname <- c("Excellence Dark 85% Cocoa 100g", "Excellence Dark 78% COCOA 100G",
"ZDEL Excellence Dark 85% Cocoa 100g", "Excellence Dark 50% Cocoa 100g")
PXname <- c("Excellence Dark 85% Cocoa 100g", "Excellence Dark 80% Cocoa 200g",
"ZDEL Excellence Dark 80% Cocoa 100g", "ZDEL Excellence Dark 85% Cocoa 100g",
"ZDEL Excellence Dark 78% Cocoa 100g", "Excellence Dark 78% Cocoa 100g",
"Excellence Dark 50% Cocoa 100g")
myfun <- function(myvec) {
sub(x = myvec, pattern = "%", replacement = " percent") %>%
sub(pattern = "(?<=[0-9])g|G", replacement = " grams", perl = T) %>%
tolower %>%
stri_split_regex(pattern = "\\s") %>%
enframe %>%
unnest(value) %>%
mutate(word = if_else(grepl(x = value, pattern = "[0-9]+"),
as.character(english(as.numeric(value))),
value)) %>%
group_by(name) %>%
summarize(string = paste0(word, collapse = " ")) -> out
return(out)
}
# Convert numbers to alphabets abd create new strings.
ww <- myfun(myvec = WWname)
px <- myfun(myvec = PXname)
# Calculate distance.
mymat <- stringdistmatrix(a = ww$string, b = px$string, method = "lcs")
rownames(mymat) <- WWname
colnames(mymat) <- PXname
# Check if there is any non-match.
mymat %>%
as.data.frame(stringsAsFactors = F) %>%
rownames_to_column(var = "WWname") %>%
pivot_longer(cols = -WWname, names_to = "PXname", values_to = "ranking") %>%
group_by(WWname) %>%
mutate(check = if_else(any(ranking == 0),
TRUE,
FALSE)) -> out
现在我们检查匹配模式。
filter(out, check == TRUE) %>%
slice(which.min(ranking))
WWname PXname ranking check
<chr> <chr> <dbl> <lgl>
1 Excellence Dark 50% Cocoa 100g Excellence Dark 50% Cocoa 100g 0 TRUE
2 Excellence Dark 78% COCOA 100G Excellence Dark 78% Cocoa 100g 0 TRUE
3 Excellence Dark 85% Cocoa 100g Excellence Dark 85% Cocoa 100g 0 TRUE
4 ZDEL Excellence Dark 85% Cocoa 100g ZDEL Excellence Dark 85% Cocoa 100g 0 TRUE
如果我们要检查不匹配项,可以执行以下操作。这将返回最短的不匹配。在这种情况下,没有匹配项。
filter(out, check == FALSE) %>%
slice(which.min(ranking))