考虑您具有以下两个数据帧dfs
和df1
dfs定义为
dfs=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsd'),
'Accepted': [2570, 1020, 2140, 120, 15]
})
dfs=dfs.groupby(['Year', 'Provider']).sum()
df1定义为
df1=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006],
'Provider':list('aabbc'),
'Gender': list('mfmfm'),
'Accepted app': ['990', '1180', '435', '405', '985']
})
我想合并这两个数据框以得到类似的结果
df2=pd.DataFrame({
'Year':[2006, 2006, 2006, 2006, 2006,2006, 2006, 2006, 2006, 2006],
'Provider':list('abcsdabcsd'),
'Accepted': [2570, 1020, 2140, 120, 15,2570, 1020, 2140, 120, 15],
'Gender': ['m', 'm', 'm', 'Nan', 'Nan', 'f', 'f', 'Nan', 'Nan', 'Nan'],
'Accepted app': ['990', '435', '985', 'Nan', 'Nan','1180', '405', 'Nan', 'Nan', 'Nan']
})
我不知道如何保留dfs
的多级索引或如何合并它们。
用途:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left'))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 m NaN
4 2006 s 120 m NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 f NaN
8 2006 d 15 f NaN
9 2006 s 120 f NaN
如果还要将Gender
列设置为缺失值:
df = (pd.concat([dfs.assign(Gender=c) for c in df1['Gender'].unique()])
.reset_index()
.merge(df1, on=['Provider','Year','Gender'], how='left', indicator=True)
.assign(Gender=lambda x: x['Gender'].mask(x['_merge'].eq('left_only')))
.drop('_merge', axis=1))
print (df)
Year Provider Accepted Gender Accepted app
0 2006 a 2570 m 990
1 2006 b 1020 m 435
2 2006 c 2140 m 985
3 2006 d 15 NaN NaN
4 2006 s 120 NaN NaN
5 2006 a 2570 f 1180
6 2006 b 1020 f 405
7 2006 c 2140 NaN NaN
8 2006 d 15 NaN NaN
9 2006 s 120 NaN NaN