如果所有列都相同且日期连续,则将多条记录合并为一条记录

问题描述 投票:0回答:1

我有以下最小的可重现示例

import pandas as pd

have = {'id': [1,1,1], 
     'start_date': ['2014-12-01 00:00:00', '2015-03-01 00:00:00', '2015-06-01 00:00:00'], 
     'end_date': ['2015-02-28 23:59:59', '2015-05-31 23:59:59', '2015-08-02 23:59:59'],
     'attr_1': ['Z', 'Z', 'Z'],
     'attr_2': ['A', 'A', ''],
     'attr_3': ['B', 'B', '']}
have = pd.DataFrame(data=have)
print(have)
   id           start_date             end_date attr_1 attr_2 attr_3
0   1  2014-12-01 00:00:00  2015-02-28 23:59:59      Z      A      B
1   1  2015-03-01 00:00:00  2015-05-31 23:59:59      Z      A      B
2   1  2015-06-01 00:00:00  2015-08-02 23:59:59      Z              

我想合并每个

id
的记录,如果:

  1. 下一条记录的
    start_date
    等于
    end_date + 1s
    (即如果间隔是连续的)
  2. 如果所有列
    attr_
    都相同。

预期结果如下

want = {'id': [1,1], 
     'start_date': ['2014-12-01 00:00:00', '2015-06-01 00:00:00'], 
     'end_date': ['2015-05-31 23:59:59', '2015-08-02 23:59:59'],
     'attr_1': ['Z', 'Z'],
     'attr_2': ['A', ''],
     'attr_3': ['B', '']}
want = pd.DataFrame(data=want)
print(want)
   id           start_date             end_date attr_1 attr_2 attr_3
0   1  2014-12-01 00:00:00  2015-05-31 23:59:59      Z      A      B
1   1  2015-06-01 00:00:00  2015-08-02 23:59:59      Z              

请注意,就我而言,有数百万条记录、数千条

id
和 60 多列需要检查。

python pandas date datetime intervals
1个回答
0
投票

尝试:

# handle datetime properly
have[['start_date','end_date']] = have[['start_date','end_date']].apply(pd.to_datetime)

# shift start dates up by a row
shifted = have.groupby(['id', 'attr_1','attr_2','attr_3'])['start_date'].shift(-1).dt.normalize()

# compare the shifted dates with the end date
connected = shifted  == have['end_date'].dt.normalize() + pd.Timedelta('1D')

blocks = connected.groupby([
    have['id'], have['attr_1'], have['attr_2'], have['attr_3']
]).cumsum()
blocks

have.groupby([blocks,'id', 'attr_1','attr_2','attr_3'], sort=False).agg({
    'start_date':'first',
    'end_date':'last'
})

输出:

                          start_date            end_date
  id attr_1 attr_2 attr_3                               
1 1  Z      A      B      2014-12-01 2015-05-31 23:59:59
0 1  Z                    2015-06-01 2015-08-02 23:59:59
© www.soinside.com 2019 - 2024. All rights reserved.