我有以下最小的可重现示例
import pandas as pd
have = {'id': [1,1,1],
'start_date': ['2014-12-01 00:00:00', '2015-03-01 00:00:00', '2015-06-01 00:00:00'],
'end_date': ['2015-02-28 23:59:59', '2015-05-31 23:59:59', '2015-08-02 23:59:59'],
'attr_1': ['Z', 'Z', 'Z'],
'attr_2': ['A', 'A', ''],
'attr_3': ['B', 'B', '']}
have = pd.DataFrame(data=have)
print(have)
id start_date end_date attr_1 attr_2 attr_3
0 1 2014-12-01 00:00:00 2015-02-28 23:59:59 Z A B
1 1 2015-03-01 00:00:00 2015-05-31 23:59:59 Z A B
2 1 2015-06-01 00:00:00 2015-08-02 23:59:59 Z
我想合并每个
id
的记录,如果:
start_date
等于end_date + 1s
(即如果间隔是连续的)attr_
都相同。预期结果如下
want = {'id': [1,1],
'start_date': ['2014-12-01 00:00:00', '2015-06-01 00:00:00'],
'end_date': ['2015-05-31 23:59:59', '2015-08-02 23:59:59'],
'attr_1': ['Z', 'Z'],
'attr_2': ['A', ''],
'attr_3': ['B', '']}
want = pd.DataFrame(data=want)
print(want)
id start_date end_date attr_1 attr_2 attr_3
0 1 2014-12-01 00:00:00 2015-05-31 23:59:59 Z A B
1 1 2015-06-01 00:00:00 2015-08-02 23:59:59 Z
请注意,就我而言,有数百万条记录、数千条
id
和 60 多列需要检查。
尝试:
# handle datetime properly
have[['start_date','end_date']] = have[['start_date','end_date']].apply(pd.to_datetime)
# shift start dates up by a row
shifted = have.groupby(['id', 'attr_1','attr_2','attr_3'])['start_date'].shift(-1).dt.normalize()
# compare the shifted dates with the end date
connected = shifted == have['end_date'].dt.normalize() + pd.Timedelta('1D')
blocks = connected.groupby([
have['id'], have['attr_1'], have['attr_2'], have['attr_3']
]).cumsum()
blocks
have.groupby([blocks,'id', 'attr_1','attr_2','attr_3'], sort=False).agg({
'start_date':'first',
'end_date':'last'
})
输出:
start_date end_date
id attr_1 attr_2 attr_3
1 1 Z A B 2014-12-01 2015-05-31 23:59:59
0 1 Z 2015-06-01 2015-08-02 23:59:59