我有两个数据框
df1
Person ID Company Symbol Number
0 John 68243Q106 NaN NaN 4
1 Alex 68243Q10 NaN NaN 3
2 Faye 33690110 NaN NaN 5
3 Sean 36901103 NaN NaN 4
4 Doug 336901103 NaN NaN 3
5 Mike 84670702 NaN NaN 1
6 John 8467070 NaN NaN 6
df2
Company Symbol ID
0 1-800 Flowers FLWS 68243Q106
1 1st Source SRCE 336901103
2 Berk BRK 84670702
3 Other1 ZZZ 1609W102
4 Other2 YYY 507K103
我在ID
上合并:
df3 = pd.merge(df1, df2, on='ID', how='left')
# Drop, rename, and reorder columns
df3 = df3.drop(['Company_x', 'Symbol_x'], 1)
df3.rename(columns={'Company_y': 'Company', 'Symbol_y': 'Symbol'}, inplace=True)
df3 = df3[['Person', 'ID', 'Company', 'Symbol', 'Number']]
问题是,有时ID
中的df1
处于关闭状态(缺少第一位或最后一位,O
而不是0
,粗指8
而不是7
,等等),这导致合并使这些行失败。例如:
1
,ID
最后一个6
丢失(68243Q10
与68243Q106
)2
,ID
最后一个3
丢失(33690110
与336901103
)3
,ID
首先缺少3
(36901103
与336901103
)6
,ID
最后一个2
丢失(8467070
与84670702
)如果我想让输出看起来像:
Person ID Company Symbol Number
0 John 68243Q106 1-800 Flowers FLWS 4
1 Alex 68243Q106 1-800 Flowers FLWS 3
2 Faye 336901103 1st Source SRCE 5
3 Sean 336901103 1st Source SRCE 4
4 Doug 336901103 1st Source SRCE 3
5 Mike 84670702 Berk BRK 1
6 John 84670702 Berk BRK 6
最佳方法是什么?对于正则表达式来说似乎太复杂了,所以我应该看看fuzzywuzzy
吗?
import fuzzy_pandas as fpd
matches = fpd.fuzzy_merge(df1, df2,
on=['ID'],
keep_left=['Person', 'Number'],
keep_right=['ID', 'Company', 'Symbol'],
ignore_case=True,
method='levenshtein',
threshold=.85)
matches = matches[['Person', 'ID', 'Company', 'Symbol', 'Number']]
Person ID Company Symbol Number
0 John 68243Q106 1-800 Flowers FLWS 4
1 Alex 68243Q106 1-800 Flowers FLWS 3
2 Faye 336901103 1st Source SRCE 5
3 Sean 336901103 1st Source SRCE 4
4 Doug 336901103 1st Source SRCE 3
5 Mike 84670702 Berk BRK 1
6 John 84670702 Berk BRK 6