我有两个数据框,分别为df1
形状(1597, 37)
和df2
形状(27293, 115)
。它们都包含一些公司的名称,邮政编码和其他数据。名称不完全匹配。
逐步合并它们的手动步骤是:
df1
和df2
中的公司名称以查找匹配的名称,并从df2
中删除已经在df1
中的公司。df2
添加到df1
。df1
,并添加了来自df2
的新公司。如果名称匹配,但邮政编码不相同,我们假定是另一家公司并保留。
df1 = pd.DataFrame({'NAME': ['Company A', 'Company B', 'Company C', 'Company D'],
'Postal Code': [9001, 9002, 9003, 9004]})
df2 = pd.DataFrame({'Name': ['this is b', 'some company d', 'c is a company',
'COMANY f', 'COMANY x', 'Company z','w company'],
'CP': [9002, 9006, 9003, 9005, 9001, 9007, 9008],
'Some other data': np.random.randn(7)})
df1
NAME Postal Code
0 Company A 9001
1 Company B 9002
2 Company C 9003
3 Company D 9004
df2
Name CP Some other data
0 this is b 9002 1.867558
1 some company d 9006 -0.977278
2 c is a company 9003 0.950088
3 COMANY f 9005 -0.151357
4 COMANY x 9001 -0.103219
5 Company z 9007 0.410599
6 w company 9008 0.144044
df1_merged
NAME Postal Code Some other data
0 Company A 9001 NaN
1 Company B 9002 0.400157
2 Company C 9003 0.978738
3 Company D 9004 NaN
4 some company d 9006 -0.977278
5 COMANY f 9005 -0.151357
6 COMANY x 9001 -0.103219
7 Company z 9007 0.410599
8 w company 9008 0.144044
您可以重命名df1列,然后合并:
df1 = df1.rename(columns={'NAME': 'Name', 'Postal Code': 'CP'})
df = pd.merge(left=df1, right=df2, how='outer')
print(df)
Name CP Some other data
0 Company A 9001 NaN
1 Company B 9002 NaN
2 Company C 9003 NaN
3 Company D 9004 NaN
4 this is b 9002 -0.881567
5 some company d 9006 0.186404
6 c is a company 9003 -0.331076
7 COMANY f 9005 -1.645201
8 COMANY x 9001 -0.978169
9 Company z 9007 0.860190
10 w company 9008 0.020805