在新列中创建一个序列,该序列按 tag_id 和 sub_id 分组,并按 tag_id 和 logdate 升序排序数据帧。预期输出如下图所示。
dfx = pd.DataFrame({'tag_id': [496976, 496976, 603100, 603100, 715078, 715078, 978924, 978924, 1276120, 1276120, 1276120],
'sub_id' :['W00-8748', 'A00-1012', 'A00-4028', 'A00-4028', 'W00-0706',
'W00-1960', 'W00-1066', 'W00-6462', 'W00-1397', 'W00-1427', 'W00-6727'],
'log_date':['2012-06-28','2013-06-28','2016-02-18','2016-02-18','2008-02-25','2008-09-28','2008-04-28','2010-10-20','2008-07-10','2008-07-15','2008-07-15']})
dfx = dfx.sort_values(by=['tag_id','log_date'],ascending=True).reset_index(drop=True)
dfx
预期输出
groupby.cumcount
并减去重复值的 groupby.cumsum
:
g = dfx.assign(d=dfx.duplicated()).groupby('tag_id')
dfx['Seq'] = g.cumcount().add(1) - g['d'].cumsum()
输出:
tag_id sub_id log_date Seq
0 496976 W00-8748 2012-06-28 1
1 496976 A00-1012 2013-06-28 2
2 603100 A00-4028 2016-02-18 1
3 603100 A00-4028 2016-02-18 1
4 715078 W00-0706 2008-02-25 1
5 715078 W00-1960 2008-09-28 2
6 978924 W00-1066 2008-04-28 1
7 978924 W00-6462 2010-10-20 2
8 1276120 W00-1397 2008-07-10 1
9 1276120 W00-1427 2008-07-15 2
10 1276120 W00-6727 2008-07-15 3