合并最相似列值的数据框

问题描述 投票:0回答:1

我有两个数据框

df1

  Person         ID  Company  Symbol  Number
0   John  68243Q106      NaN     NaN       4
1   Alex   68243Q10      NaN     NaN       3
2   Faye   33690110      NaN     NaN       5
3   Sean   36901103      NaN     NaN       4
4   Doug  336901103      NaN     NaN       3
5   Mike   84670702      NaN     NaN       1
6   John    8467070      NaN     NaN       6

df2

         Company Symbol         ID
0  1-800 Flowers   FLWS  68243Q106
1     1st Source   SRCE  336901103
2           Berk    BRK   84670702
3         Other1    ZZZ   1609W102
4         Other2    YYY    507K103

我在ID上合并:

df3 = pd.merge(df1, df2, on='ID', how='left')
# Drop, rename, and reorder columns
df3 = df3.drop(['Company_x', 'Symbol_x'], 1)
df3.rename(columns={'Company_y': 'Company', 'Symbol_y': 'Symbol'}, inplace=True)
df3 = df3[['Person', 'ID', 'Company', 'Symbol', 'Number']]

问题是,有时ID中的df1处于关闭状态(缺少第一位或最后一位,O而不是0,粗指8而不是7,等等),这导致合并使这些行失败。例如:

  • 1ID最后一个6丢失(68243Q1068243Q106
  • 2ID最后一个3丢失(33690110336901103
  • 3ID首先缺少336901103336901103
  • 6ID最后一个2丢失(846707084670702

如果我想让输出看起来像:

  Person         ID        Company Symbol  Number
0   John  68243Q106  1-800 Flowers   FLWS       4
1   Alex  68243Q106  1-800 Flowers   FLWS       3
2   Faye  336901103     1st Source   SRCE       5
3   Sean  336901103     1st Source   SRCE       4
4   Doug  336901103     1st Source   SRCE       3
5   Mike   84670702           Berk    BRK       1
6   John   84670702           Berk    BRK       6

最佳方法是什么?对于正则表达式来说似乎太复杂了,所以我应该看看fuzzywuzzy吗?

regex python-3.x pandas fuzzy-search
1个回答
0
投票
import fuzzy_pandas as fpd

matches = fpd.fuzzy_merge(df1, df2,
                          on=['ID'],
                          keep_left=['Person', 'Number'],
                          keep_right=['ID', 'Company', 'Symbol'],
                          ignore_case=True,
                          method='levenshtein',
                          threshold=.85)

matches = matches[['Person', 'ID', 'Company', 'Symbol', 'Number']]


  Person         ID        Company Symbol  Number
0   John  68243Q106  1-800 Flowers   FLWS       4
1   Alex  68243Q106  1-800 Flowers   FLWS       3
2   Faye  336901103     1st Source   SRCE       5
3   Sean  336901103     1st Source   SRCE       4
4   Doug  336901103     1st Source   SRCE       3
5   Mike   84670702           Berk    BRK       1
6   John   84670702           Berk    BRK       6
© www.soinside.com 2019 - 2024. All rights reserved.