我有两个数据框。其中包含时间戳和项目的一个。另一个具有日期范围,项目和到期日,必须将其映射到日期范围内的相应项目。
我的问题是similar to this question,但是提供的答案非常慢,我还有其他条件需要满足。首先,我的两个数据框看起来像这样:
In:
import pandas as pd
df_a = pd.DataFrame({'time': ('06.05.2015 16:15:16', '22.06.2015 08:52:05', '28.05.2015 18:20:21','28.06.2015 16:19:21'),
'project': ('project1', 'project2', 'project2', 'project1')})
df_b = pd.DataFrame({'start-date': ('02.05.2015 00:00:00', '26.06.2015 00:00:00', '16.05.2015 00:00:00', '30.05.2015 00:00:00'),
'end-date':('24.06.2015 00:00:00', '27.07.2015 00:00:00', '27.05.2015 00:00:00', '27.06.2015 00:00:00'),
'project': ('project1','project1','project2','project2'),
'maturity': ('one','two', 'one','two')})
我的代码如下:
for i in df_a.project.unique():
for j in df_b.project.unique():
if i == j:
for index_df_a, row_df_a in df_a.iterrows():
for index_df_b, row_df_b in df_b.iterrows():
if (row_df_a['time'] >= row_df_b['start-date']) & (row_df_a['time'] <= row_df_b['end-date']):
df_a.loc[index_df_a, 'maturity'] = row_df_b.loc['maturity']
break
Out:
time project maturity
0 06.05.2015 16:15:16 project1 one
1 22.06.2015 08:52:05 project2 one
2 28.05.2015 18:20:21 project2 NaN
3 28.06.2015 16:19:21 project1 NaN
预期结果:
time project maturity
0 06.05.2015 16:15:16 project1 one
1 22.06.2015 08:52:05 project2 one
2 28.05.2015 18:20:21 project2 two
3 28.06.2015 16:19:21 project1 two
if i==j:
陈述是错误的。从结果的第4行可以看出:即使项目已映射到project1
,并且时间戳28.06.2015 16:19:21
在start:26.06.2015 00:00:00 | end: 27.07.2015 00:00:00
范围内,其到期时间仍是NaN
,而不是two
。28.05.2015 18:20:21
不在任何日期范围内,则下一个日期范围将提供到期日。在这种情况下,two
。不好意思,我一次问的太多了。我知道最好的方法是通过问一些简单的问题并逐步实现结果来得出答案,但是我经验不足,不足以将问题分解为更小的部分。
for-loops
的数据帧,让我感到畏缩:pd.date_range
与start-date
和end-date
一起使用,将d_range
列添加到df_b
,然后可以使用.isin
从time
内部的df_a
中查找d_range
d_range
将是开始和结束之间的日期列表。time
格式不正确,则将与d_range
中的日期不匹配。time
中找不到d_time
。import pandas as pd
# create dataframes from your test set and clean-up the datetime columns
df_a['time'] = (pd.to_datetime(df_a['time'], format='%d.%m.%Y %H:%M:%S')).dt.date
df_b['start-date'] = pd.to_datetime(df_b['start-date'], format='%d.%m.%Y %H:%M:%S').dt.date
df_b['end-date'] = pd.to_datetime(df_b['end-date'], format='%d.%m.%Y %H:%M:%S').dt.date
# df_a view
time project
2015-05-06 project1
2015-06-22 project2
2015-05-28 project2
2015-06-28 project1
# df_b view
start-date end-date project maturity
2015-05-02 2015-06-24 project1 one
2015-06-26 2015-07-27 project1 two
2015-05-16 2015-05-27 project2 one
2015-05-30 2015-06-27 project2 two
# add d_range to df_b
df_b['d_range'] = df_b[['start-date', 'end-date']].apply(lambda x: pd.date_range(x[0], x[1]), axis=1)
maturity
添加到df_a
mask
是从df_b
中搜索df_a
的日期的结果mask
匹配任何项目的日期return
仅是匹配项目的结果def date_query(x):
mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
return result
# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)
# df_a updated
time project maturity
2015-05-06 project1 one
2015-06-22 project2 two
2015-05-28 project2 NaN
2015-06-28 project1 two
result
中的[def date_query
是pandas.Series
,如果没有匹配的日期范围,它将为空,可以使用.empty
进行检查def date_query
以检查result
是否为空。如果def check_min_timedelta
为空,则呼叫result
。.idxmin
将返回第一次出现的值def check_min_timedelta(x):
"""
Create a timedelta between time and end-date
Return maturity for the row with the minimum time date
"""
end_diff = abs(df_b['end-date'][df_b['project'] == x[1]] - x[0]).idxmin()
return df_b['maturity'].loc[end_diff]
# update def date_query
def date_query(x):
mask = df_b[['project', 'maturity']][df_b['d_range'].apply(lambda y: y.isin([x[0]]).any())].reset_index(drop=True)
result = mask['maturity'][mask['project'] == x[1]].reset_index(drop=True)
if result.empty:
result = check_min_timedelta(x)
return result
# call function
df_a['maturity'] = df_a.apply(lambda x: date_query(x), axis=1)
# final df_a:
time project maturity
2015-05-06 project1 one
2015-06-22 project2 two
2015-05-28 project2 one
2015-06-28 project1 two