如何通过仅保留第一个事件来删除重复,但仅适用于 pandas 上的一个类别

问题描述 投票:0回答:2

如何通过仅保留第一个事件来删除重复,但仅适用于 pandas 上的一个类别。

event_name
列上有任意两个类别,
process_now
fast_order
,但删除重复项有一些特殊性: 1.仅删除
fast_order
类别上的重复项 2. 如果
fast_order
连续出现多个,则每次连续只保留一个(不是每个用户id) 3.删除重复是保留第一个条目出现

数据

User_id   event_name        timestamp
1         process_now       08:00:01
1         process_now       08:00:02
1         process_now       08:00:03
1         fast_order        08:00:04
1         fast_order        08:00:05
1         process_now       08:00:06
2         process_now       08:00:01
2         process_now       08:00:02
2         fast_order        08:00:03
2         fast_order        08:00:04
2         fast_order        08:00:05
2         process_now       08:00:06
2         fast_order        08:00:07
2         fast_order        08:00:08
2         process_now       08:00:09

我需要展示的是

User_id   Event_name        timestamp
1         process_now       08:00:01
1         process_now       08:00:02
1         process_now       08:00:03
1         fast_order        08:00:04
1         process_now       08:00:06
2         process_now       08:00:01
2         process_now       08:00:02
2         fast_order        08:00:03
2         process_now       08:00:06
2         fast_order        08:00:07
2         process_now       08:00:09

我该怎么做?

python pandas dataframe duplicates
2个回答
2
投票

每 2 列使用

DataFrame.duplicated
获取连续组、逆条件和按位
|
进行链式
OR
,如果不等于则测试条件
fast_order
:

g = df['event_name'].ne(df['event_name'].shift()).cumsum()
df = df[df['event_name'].ne('fast_order') | ~df.assign(g=g).duplicated(['User_id','g'])]
print (df)
    User_id   event_name timestamp
0         1  process_now  08:00:01
1         1  process_now  08:00:02
2         1  process_now  08:00:03
3         1   fast_order  08:00:04
5         1  process_now  08:00:06
6         2  process_now  08:00:01
7         2  process_now  08:00:02
8         2   fast_order  08:00:03
11        2  process_now  08:00:06
12        2   fast_order  08:00:07
14        2  process_now  08:00:09

详情

print (df.assign(g=g))
    User_id   event_name timestamp  g
0         1  process_now  08:00:01  1
1         1  process_now  08:00:02  1
2         1  process_now  08:00:03  1
3         1   fast_order  08:00:04  2
5         1  process_now  08:00:06  3
6         2  process_now  08:00:01  3
7         2  process_now  08:00:02  3
8         2   fast_order  08:00:03  4
11        2  process_now  08:00:06  5
12        2   fast_order  08:00:07  6
14        2  process_now  08:00:09  7

print (df.assign(g=g).duplicated(['User_id','g']))
0     False
1      True
2      True
3     False
5     False
6     False
7      True
8     False
11    False
12    False
14    False
dtype: bool

print (~df.assign(g=g).duplicated(['User_id','g']))
0      True
1     False
2     False
3      True
5      True
6      True
7     False
8      True
11     True
12     True
14     True
dtype: bool

0
投票
df1.groupby(df1.event_name.ne("fast_order").cumsum(),as_index=False,group_keys=False).apply(lambda dd:dd.drop_duplicates(subset='event_name',keep='first'))


 User_id   event_name timestamp
0         1  process_now  08:00:01
1         1  process_now  08:00:02
2         1  process_now  08:00:03
3         1   fast_order  08:00:04
5         1  process_now  08:00:06
6         2  process_now  08:00:01
7         2  process_now  08:00:02
8         2   fast_order  08:00:03
11        2  process_now  08:00:06
12        2   fast_order  08:00:07
14        2  process_now  08:00:09
© www.soinside.com 2019 - 2024. All rights reserved.