模糊匹配含数字的字符串

问题描述 投票:1回答:1

我正在尝试在参考字符串和目标字符串之间进行近似匹配。

我已经尝试了在R中使用各种可用距离的adist和stringdist。

虽然算法很好地完成了仅用字母匹配字符串的工作,但是无法匹配存在数字和特殊字符(%等)的字符串。

如何处理此案?

以下是我的代码。

library(stringdist)

dist.name <- outer(tolower(WW_name),tolower(Px_name),
                   stringdist::stringdist, method = "lcs")

# We now take the pairs with the minimum distance
min.name<-apply(dist.name, 1, min)

match.s1.s2<-NULL

chk <- function(x){
  s2.i<-match(min.name[x],dist.name[x,])
  s1.i<-x
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,WWname=WW_lookup[s1.i],
                                Pxname=Px_lookup[s2.i],
                                adist=min.name[x]),match.s1.s2)
  return(match.s1.s2)
}

outDf <- lapply(1:nrow(dist.name),FUN = chk)

outDf <- do.call(rbind.data.frame, outDf)

匹配不正确的示例-

enter image description here

Pxname是来自算法的匹配| MappedPxName手动映射

任何建议将不胜感激。

r fuzzy-logic fuzzy-comparison stringdist
1个回答
0
投票

这是我的尝试。由于您无法共享数据,因此我为您提供的服务有限。但是我希望以下内容可以帮助您解决现有问题。我认为这里的挑战是将数字转换为书面数字。这是我们可以用英语包处理的。我创建了两个向量(即WWname和PXname)。我还创建了自己的函数myfun。这可以处理一些字符串操作,例如将%转换为百分比,以及将数字转换为书面数字。我使用了此功能,并创建了两个数据框(即ww和px)。

library(tidyverse)
library(english)
library(stringi)
library(stringdist)

WWname <- c("Excellence Dark 85% Cocoa 100g", "Excellence Dark 78% COCOA 100G",
            "ZDEL Excellence Dark 85% Cocoa 100g", "Excellence Dark 50% Cocoa 100g")

PXname <- c("Excellence Dark 85% Cocoa 100g", "Excellence Dark 80% Cocoa 200g",
            "ZDEL Excellence Dark 80% Cocoa 100g", "ZDEL Excellence Dark 85% Cocoa 100g",
            "ZDEL Excellence Dark 78% Cocoa 100g", "Excellence Dark 78% Cocoa 100g",
            "Excellence Dark 50% Cocoa 100g")

myfun <- function(myvec) {

  sub(x = myvec, pattern = "%", replacement = " percent") %>%
  sub(pattern = "(?<=[0-9])g|G", replacement = " grams", perl = T) %>%
  tolower %>% 
  stri_split_regex(pattern = "\\s") %>% 
  enframe %>% 
  unnest(value) %>% 
  mutate(word = if_else(grepl(x = value, pattern = "[0-9]+"),
                        as.character(english(as.numeric(value))),
                        value)) %>% 
  group_by(name) %>% 
  summarize(string = paste0(word, collapse = " ")) -> out
  return(out)
}

# Convert numbers to alphabets abd create new strings.

ww <- myfun(myvec = WWname)
px <- myfun(myvec = PXname)

# Calculate distance.
mymat <- stringdistmatrix(a = ww$string, b = px$string, method = "lcs")

rownames(mymat) <- WWname
colnames(mymat) <- PXname

# Check if there is any non-match.
mymat %>%
as.data.frame(stringsAsFactors = F) %>%
rownames_to_column(var = "WWname") %>%
pivot_longer(cols = -WWname, names_to = "PXname", values_to = "ranking") %>% 
group_by(WWname) %>%
mutate(check = if_else(any(ranking == 0),
                       TRUE,
                       FALSE)) -> out

现在我们检查匹配模式。

filter(out, check == TRUE) %>%
slice(which.min(ranking))

  WWname                              PXname                              ranking check
  <chr>                               <chr>                                 <dbl> <lgl>
1 Excellence Dark 50% Cocoa 100g      Excellence Dark 50% Cocoa 100g            0 TRUE 
2 Excellence Dark 78% COCOA 100G      Excellence Dark 78% Cocoa 100g            0 TRUE 
3 Excellence Dark 85% Cocoa 100g      Excellence Dark 85% Cocoa 100g            0 TRUE 
4 ZDEL Excellence Dark 85% Cocoa 100g ZDEL Excellence Dark 85% Cocoa 100g       0 TRUE

如果我们要检查不匹配项,可以执行以下操作。这将返回最短的不匹配。在这种情况下,没有匹配项。

filter(out, check == FALSE) %>%
slice(which.min(ranking)) 
© www.soinside.com 2019 - 2024. All rights reserved.