我有2个数据框。DF1
ID Address
AB1 VILL +PO CHAPAR TAPUKADA ALWAR
AB2 VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA 0 SIROHI
AB3 RAMKUMAR YADAV VILL KANSL 0 JAIPUR
AB4 VILL KHERKI MUKKER POSTPANIYA PUTLI JAIPUR
和,df2
Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA
PAR
UR
WAR
RIYA
DAV
LI
我想在DF1字符串中应用模糊逻辑。如果在DF1中给出的名字与DF2相匹配,给我DF2的名字。
输出应该是这样的
ID Address Name
AB1 VILL +PO CHAPAR TAPUKADA ALWAR CHHAPPAR, CHHAPAR
AB2 VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA 0 SIROHI
AB3 RAMKUMAR YADAV VILL KANSL 0 JAIPUR KANSAL, KANSIL, KANSOL
AB4 VILL KHERKI MUKKER POSTPANIYA PUTLI JAIPUR KHERK, KHERKIA
我试着应用FuzzywuzzyR,但它给出了一个错误。
我也试过agrep,但它给我的结果是TrueFalse。
请帮我解决这个问题。另外,我是否应该尝试其他的模糊包?
我将使用的包 fuzzyjoin
的逻辑,这与 tidytext
:
library(tidytext)
library(fuzzyjoin)
library(tidyverse)
df1 %>%
unnest_tokens(word, Address, to_lower = FALSE) %>%
fuzzyjoin::stringdist_left_join(df2, by = c("word" = "Name"), max_dist = 1) %>%
group_by(ID) %>% # collapse unnested tokens back to text if you want
summarise(text = paste(word, collapse = " "),
Name = toString(na.omit(Name)))
#> # A tibble: 4 x 3
#> ID text Name
#> <chr> <chr> <chr>
#> 1 AB1 VILL PO CHAPAR TAPUKADA ALWAR "CHHAPAR"
#> 2 AB2 VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POS~ ""
#> 3 AB3 RAMKUMAR YADAV VILL KANSL KANSL KANSL 0 JAIPUR "KANSAL, KANSIL, K~
#> 4 AB4 VILL KHERKI KHERKI MUKKER POSTPANIYA PUTLI JAIPUR "KHERK, KHERKIA"
df1 <- read.csv(text = "ID,Address
AB1,VILL +PO CHAPAR TAPUKADA ALWAR
AB2,VILL WARD NO 02 THIKARIYA CHAND RAWAT JUNA PADA POST BADANA 0 SIROHI
AB3,RAMKUMAR YADAV VILL KANSL 0 JAIPUR
AB4,VILL KHERKI MUKKER POSTPANIYA PUTLI JAIPUR", stringsAsFactors = FALSE)
df2 <- read.csv(text = "Name
CHHAPPAR
CHHAPAR
KANSAL
KANSIL
KANSOL
KHERK
KHERKIA", stringsAsFactors = FALSE)