我有以下数据框
import pandas as pd
foo1 = pd.DataFrame({'id':[1,1,2,2],
'phase':['Pre','Post','Pre','Post'],
'date_start': ['2022-07-24', '2022-12-25', '2022-09-30', '2022-12-25'],
'date_end': ['2022-07-30', '2023-03-07', '2022-10-05', '2023-03-04']})
foo2 = pd.DataFrame({'id': [1,1,1,1,
2,2,2,2],
'date': ['2022-07-24', '2022-07-25', '2022-12-26', '2023-01-01',
'2022-10-04', '2022-11-25', '2022-12-26', '2023-03-01']})
print(foo1, '\n' ,foo2)
id phase date_start date_end
0 1 Pre 2022-07-24 2022-07-30
1 1 Post 2022-12-25 2023-03-07
2 2 Pre 2022-09-30 2022-10-05
3 2 Post 2022-12-25 2023-03-04
id date
0 1 2022-07-24
1 1 2022-07-25
2 1 2022-12-26
3 1 2023-01-01
4 2 2022-10-04
5 2 2022-11-25
6 2 2022-12-26
7 2 2023-03-01
如果
phase
在foo2
和id
之间,我想通过合并date
和来获得
date_start
中的date_end
列。如果 date
不在 [date_start,date_end]
范围内,则相列应具有 NaN
生成的数据框应如下所示:
id date phase
0 1 2022-07-24 Pre
1 1 2022-07-25 Pre
2 1 2022-12-26 Post
3 1 2023-01-01 Post
4 2 2022-10-04 Pre
5 2 2022-11-25 NaN
6 2 2022-12-26 Post
7 2 2023-03-01 Post
我怎么能那样做?
我找到了this但它不包括“与
id
合并”
通过
id
使用内部连接并通过Series.between
与DataFrame.loc
比较列,最后通过左连接添加缺失对id, date
:
df1 = foo2.merge(foo1, on='id')
df2 = df1.loc[df1['date'].between(df1['date_start'],df1['date_end']), ['id','date','phase']]
df = foo2.merge(df2, how='left')
print (df)
id date phase
0 1 2022-07-24 Pre
1 1 2022-07-25 Pre
2 1 2022-12-26 Post
3 1 2023-01-01 Post
4 2 2022-10-04 Pre
5 2 2022-11-25 NaN
6 2 2022-12-26 Post
7 2 2023-03-01 Post
如果可能的话,
df2
中的多行匹配是可能的,只过滤第一行:
df = foo2.merge(df2.drop_duplicates(['id','date']), how='left')