如何根据另一个数据帧的两列过滤一个数据帧,其中一列是完全匹配,另一列是子字符串匹配?

问题描述 投票:0回答:1

我有两个数据框,都有一个

Last_Name
列。第一个数据框有一个列
Contains_First_Name
,第二个数据框有一个名为
Last_Name
的列。我想将两者结合到
Last_Name
的精确拼写以及
Contains_First_Name
First_Name
的子字符串匹配(其中
First_Name
Contains_First_Name
的子字符串。)请参阅下面的示例。

library(dplyr)
library(stringr)

# Create df1
Last_Name <- c("Smith", "Jones", "Adams", "Rogers", "Lee", "Lee", "Lee")
Contains_First_Name <- c("Kimberly Nicole", "Patrick L", "Johnson Ann", "Rick", "McAdams Jennifer Marie", "Kirk", "Kirk B")
Account_Number <- c("123", "345", "678", "901", "234", "567", "890")

df1 <- data.frame(Last_Name, Contains_First_Name, Account_Number)

# Create df2
Last_Name <- c("Smith", "Jones", "Adams", "Lee", "Lee")
First_Name <- c("Kimberly", "Patrick", "Ann", "Jennifer", "Kirk")

df2 <- data.frame(Last_Name, First_Name)

生成的数据帧:

> df1
  Last_Name    Contains_First_Name Account_Number
1     Smith        Kimberly Nicole            123
2     Jones              Patrick L            345
3     Adams            Johnson Ann            678
4    Rogers                   Rick            901
5       Lee McAdams Jennifer Marie            234
6       Lee                   Kirk            567
7       Lee                 Kirk B            890
> df2
  Last_Name First_Name
1     Smith   Kimberly
2     Jones    Patrick
3     Adams        Ann
4       Lee   Jennifer
5       Lee       Kirk

我想要的最终结果是:

> df3
  Last_Name    Contains_First_Name Account_Number First_Name
1     Smith        Kimberly Nicole            123 Kimberly
2     Jones              Patrick L            345 Patrick
3     Adams            Johnson Ann            678 Ann
4       Lee McAdams Jennifer Marie            234 Jennifer
5       Lee                   Kirk            567 Kirk
6       Lee                 Kirk B            890 Kirk

我试过这个:

df3 <-
  filter(df1,
         Last_Name %in% df2$Last_Name,
         str_detect(Contains_First_Name, paste(df2$First_Name, collapse = "|")))

出现以下错误:

Error in match.arg(method) : 'arg' must be NULL or a character vector

我还探索了

fuzzyjoin
库,但无法弄清楚如何连接具有两种不同连接类型(精确和子字符串)的两个变量。我看到了一个类似的问题,但似乎没有答案:合并两个数据帧基于R 中一列中的精确匹配和另一列中的错误匹配。 任何意见是极大的赞赏。谢谢你。

r dataframe filter substring fuzzyjoin
1个回答
0
投票

我想说你有两个选择:要么仅在第一列上使用等连接并稍后进行过滤,要么使用

fuzzyjoin
,正如你所描述的:

# Approach 1: Match all, filter later

inner_join(df1, df2, join_by(Last_Name), relationship = "many-to-many") |> 
  filter(str_detect(Contains_First_Name, First_Name))
#> # A tibble: 6 × 4
#>   Last_Name Contains_First_Name    Account_Number First_Name
#>   <chr>     <chr>                  <chr>          <chr>     
#> 1 Smith     Kimberly Nicole        123            Kimberly  
#> 2 Jones     Patrick L              345            Patrick   
#> 3 Adams     Johnson Ann            678            Ann       
#> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
#> 5 Lee       Kirk                   567            Kirk      
#> 6 Lee       Kirk B                 890            Kirk

# Approach 2: fuzzyjoin

fuzzyjoin::fuzzy_inner_join(
  df1,
  df2,
  by = c("Last_Name" = "Last_Name", "Contains_First_Name" = "First_Name"),
  match_fun = list(`==`, \(x, y) str_detect(x, y))
) |> 
  select(!Last_Name.y) |> 
  rename(Last_Name = Last_Name.x)
#> # A tibble: 6 × 4
#>   Last_Name Contains_First_Name    Account_Number First_Name
#>   <chr>     <chr>                  <chr>          <chr>     
#> 1 Smith     Kimberly Nicole        123            Kimberly  
#> 2 Jones     Patrick L              345            Patrick   
#> 3 Adams     Johnson Ann            678            Ann       
#> 4 Lee       McAdams Jennifer Marie 234            Jennifer  
#> 5 Lee       Kirk                   567            Kirk      
#> 6 Lee       Kirk B                 890            Kirk

创建于 2024-01-06,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.