如何通过仅保留第一个事件来删除重复,但仅适用于 pandas 上的一个类别。
event_name
列上有任意两个类别,process_now
和fast_order
,但删除重复项有一些特殊性:
1.仅删除fast_order
类别上的重复项
2. 如果fast_order
连续出现多个,则每次连续只保留一个(不是每个用户id)
3.删除重复是保留第一个条目出现
数据
User_id event_name timestamp
1 process_now 08:00:01
1 process_now 08:00:02
1 process_now 08:00:03
1 fast_order 08:00:04
1 fast_order 08:00:05
1 process_now 08:00:06
2 process_now 08:00:01
2 process_now 08:00:02
2 fast_order 08:00:03
2 fast_order 08:00:04
2 fast_order 08:00:05
2 process_now 08:00:06
2 fast_order 08:00:07
2 fast_order 08:00:08
2 process_now 08:00:09
我需要展示的是
User_id Event_name timestamp
1 process_now 08:00:01
1 process_now 08:00:02
1 process_now 08:00:03
1 fast_order 08:00:04
1 process_now 08:00:06
2 process_now 08:00:01
2 process_now 08:00:02
2 fast_order 08:00:03
2 process_now 08:00:06
2 fast_order 08:00:07
2 process_now 08:00:09
我该怎么做?
DataFrame.duplicated
获取连续组、逆条件和按位 |
进行链式 OR
,如果不等于则测试条件 fast_order
:
g = df['event_name'].ne(df['event_name'].shift()).cumsum()
df = df[df['event_name'].ne('fast_order') | ~df.assign(g=g).duplicated(['User_id','g'])]
print (df)
User_id event_name timestamp
0 1 process_now 08:00:01
1 1 process_now 08:00:02
2 1 process_now 08:00:03
3 1 fast_order 08:00:04
5 1 process_now 08:00:06
6 2 process_now 08:00:01
7 2 process_now 08:00:02
8 2 fast_order 08:00:03
11 2 process_now 08:00:06
12 2 fast_order 08:00:07
14 2 process_now 08:00:09
详情:
print (df.assign(g=g))
User_id event_name timestamp g
0 1 process_now 08:00:01 1
1 1 process_now 08:00:02 1
2 1 process_now 08:00:03 1
3 1 fast_order 08:00:04 2
5 1 process_now 08:00:06 3
6 2 process_now 08:00:01 3
7 2 process_now 08:00:02 3
8 2 fast_order 08:00:03 4
11 2 process_now 08:00:06 5
12 2 fast_order 08:00:07 6
14 2 process_now 08:00:09 7
print (df.assign(g=g).duplicated(['User_id','g']))
0 False
1 True
2 True
3 False
5 False
6 False
7 True
8 False
11 False
12 False
14 False
dtype: bool
print (~df.assign(g=g).duplicated(['User_id','g']))
0 True
1 False
2 False
3 True
5 True
6 True
7 False
8 True
11 True
12 True
14 True
dtype: bool
df1.groupby(df1.event_name.ne("fast_order").cumsum(),as_index=False,group_keys=False).apply(lambda dd:dd.drop_duplicates(subset='event_name',keep='first'))
User_id event_name timestamp
0 1 process_now 08:00:01
1 1 process_now 08:00:02
2 1 process_now 08:00:03
3 1 fast_order 08:00:04
5 1 process_now 08:00:06
6 2 process_now 08:00:01
7 2 process_now 08:00:02
8 2 fast_order 08:00:03
11 2 process_now 08:00:06
12 2 fast_order 08:00:07
14 2 process_now 08:00:09