R中字符串的模糊逻辑

问题描述 投票:1回答:1

我有2个数据框。DF1

ID   Address
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR

和,df2

    Name
    CHHAPPAR
    CHHAPAR
    KANSAL
    KANSIL
    KANSOL
    KHERK
    KHERKIA
    PAR
    UR
   WAR
   RIYA
   DAV
   LI

我想在DF1字符串中应用模糊逻辑。如果在DF1中给出的名字与DF2相匹配,给我DF2的名字。

输出应该是这样的

ID   Address                                                                 Name
AB1  VILL +PO CHAPAR TAPUKADA  ALWAR                                         CHHAPPAR, CHHAPAR
AB2  VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3  RAMKUMAR YADAV VILL  KANSL   0 JAIPUR                                   KANSAL, KANSIL, KANSOL
AB4  VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR                           KHERK, KHERKIA

我试着应用FuzzywuzzyR,但它给出了一个错误。

我也试过agrep,但它给我的结果是TrueFalse。

请帮我解决这个问题。另外,我是否应该尝试其他的模糊包?

r fuzzy-logic fuzzywuzzy agrep
1个回答
1
投票

我将使用的包 fuzzyjoin 的逻辑,这与 tidytext:

library(tidytext)
library(fuzzyjoin)
library(tidyverse)

df1 %>% 
  unnest_tokens(word, Address, to_lower = FALSE) %>% 
  fuzzyjoin::stringdist_left_join(df2, by = c("word" = "Name"), max_dist = 1) %>% 
  group_by(ID) %>% # collapse unnested tokens back to text if you want
  summarise(text = paste(word, collapse = " "),
            Name = toString(na.omit(Name)))
#> # A tibble: 4 x 3
#>   ID    text                                                 Name               
#>   <chr> <chr>                                                <chr>              
#> 1 AB1   VILL PO CHAPAR TAPUKADA ALWAR                        "CHHAPAR"          
#> 2 AB2   VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POS~ ""                 
#> 3 AB3   RAMKUMAR YADAV VILL KANSL KANSL KANSL 0 JAIPUR       "KANSAL, KANSIL, K~
#> 4 AB4   VILL KHERKI KHERKI MUKKER POSTPANIYA PUTLI JAIPUR    "KHERK, KHERKIA"

数据

df1 <- read.csv(text = "ID,Address
AB1,VILL +PO CHAPAR TAPUKADA  ALWAR
AB2,VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA  0 SIROHI
AB3,RAMKUMAR YADAV VILL  KANSL   0 JAIPUR
AB4,VILL KHERKI MUKKER  POSTPANIYA PUTLI   JAIPUR", stringsAsFactors = FALSE)

df2 <- read.csv(text = "Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA", stringsAsFactors = FALSE)
© www.soinside.com 2019 - 2024. All rights reserved.