我有这个玩具数据集:
df = pd.DataFrame({'user':[1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,4],
'd1':['1995-09-01','1995-09-02','1995-10-03','1995-10-04','1995-10-05','1995-11-07','1995-11-08','1995-11-09','1995-11-10','1995-11-15','1995-12-18','1995-12-19','1995-12-20','1995-12-23','1995-12-26','1995-12-30'],
'd2':['1995-10-05','1995-10-05','1995-10-05',\
'1995-11-08','1995-11-08','1995-11-08','1995-11-08',\
'1995-12-10','1995-12-10','1995-12-10','1995-12-10',\
'1995-12-27','1995-12-27','1995-12-27','1995-12-27','1995-12-27'],})
当按用户和 d1 (
df = df.sort_values(['user', 'd1'])
) 排序时,得出:
user d1 d2
1 1995-09-01 1995-10-05
1 1995-09-02 1995-10-05
1 1995-10-03 1995-10-05
2 1995-10-04 1995-11-08
2 1995-10-05 1995-11-08
2 1995-11-07 1995-11-08
2 1995-11-08 1995-11-08
3 1995-11-09 1995-12-10
3 1995-11-10 1995-12-10
3 1995-11-15 1995-12-10
3 1995-12-18 1995-12-10
4 1995-12-19 1995-12-27
4 1995-12-20 1995-12-27
4 1995-12-23 1995-12-27
4 1995-12-26 1995-12-27
4 1995-12-30 1995-12-27
需要生成一个新列[d3],其中d1到d2列最接近。例如,如果 d1 中存在 d2 日期,则 d3 显示 d2 日期。否则显示最近的日期。
请注意,结果按用户分组。
以下数据框是所需的结果:
user d1 d2 d3
1 1995-09-01 1995-10-05 1995-10-03
1 1995-09-02 1995-10-05 1995-10-03
1 1995-10-03 1995-10-05 1995-10-03
2 1995-10-04 1995-11-08 1995-11-08
2 1995-10-05 1995-11-08 1995-11-08
2 1995-11-07 1995-11-08 1995-11-08
2 1995-11-08 1995-11-08 1995-11-08
3 1995-11-09 1995-12-10 1995-12-18
3 1995-11-10 1995-12-10 1995-12-18
3 1995-11-15 1995-12-10 1995-12-18
3 1995-12-18 1995-12-10 1995-12-18
4 1995-12-19 1995-12-27 1995-12-26
4 1995-12-20 1995-12-27 1995-12-26
4 1995-12-23 1995-12-27 1995-12-26
4 1995-12-26 1995-12-27 1995-12-26
4 1995-12-30 1995-12-27 1995-12-26
您可以计算两个日期之间的绝对差,获取每组的最小值和
map
值:
df[['d1', 'd2']] = df[['d1', 'd2']].apply(pd.to_datetime)
idx = df['d2'].sub(df['d1']).abs().groupby(df['user']).idxmin()
df['d3'] = df['user'] .map(df.loc[idx, 'd1'].set_axis(idx.index))