我有一个带有 group_ids 的数据集。当组始终出现在我的数据集中时,我想将唯一的 id 附加到组中,这意味着它们最多消失 5 秒。如果它们消失超过 5 秒,它们应该获得一个新的累积 ID。如果它们保持存在或仅消失 5 秒,则它们应该保持相同的累积数量。
这是我的数据集:
group_id dt2 unique_id
0 nan 2023-11-28 17:43:09.900628 1
1 1 2023-11-28 17:43:11.322793 2
2 1 2023-11-28 17:43:12.660818 2
3 1 2023-11-28 17:43:14.119043 2
4 1 2023-11-28 17:43:15.550513 2
5 2 2023-11-28 17:43:15.550513 3
6 3 2023-11-28 17:43:15.550513 4
7 4 2023-11-28 17:43:15.550513 5
8 1 2023-11-28 17:43:16.973557 6
9 2 2023-11-28 17:43:16.973557 7
10 3 2023-11-28 17:43:16.973557 8
11 4 2023-11-28 17:43:16.973557 9
12 1 2023-11-28 17:43:18.335619 10
13 2 2023-11-28 17:43:18.335619 11
14 3 2023-11-28 17:43:18.335619 12
15 4 2023-11-28 17:43:18.335619 13
16 1 2023-11-28 17:43:19.738230 14
17 2 2023-11-28 17:43:19.738230 15
18 3 2023-11-28 17:43:19.738230 16
19 4 2023-11-28 17:43:19.738230 17
20 1 2023-11-28 17:43:21.110693 18
21 2 2023-11-28 17:43:21.110693 19
22 1 2023-11-28 17:43:22.571257 20
23 2 2023-11-28 17:43:22.571257 21
24 1 2023-11-28 17:43:24.000589 22
25 1 2023-11-28 17:43:25.429940 22
26 2 2023-11-28 17:43:25.429940 23
27 1 2023-11-28 17:43:26.851142 24
28 2 2023-11-28 17:43:26.851142 25
29 1 2023-11-28 17:43:28.256274 26
30 nan 2023-11-28 17:43:29.617541 27
31 nan 2023-11-28 17:43:30.974490 27
32 nan 2023-11-28 17:43:32.360739 27
33 1 2023-11-28 17:43:33.730457 28
34 1 2023-11-28 17:43:35.270380 28
我尝试了这个 cumsum() 方法来创建我的“unique_id”列:
df['group_id'] = (df['group_id'].eq(0) | (df['group_id'] != df['group_id'].shift())).cumsum()
这是正确的前进方向,但它为我的组附加了新的累积值,即使它们是基于日期时间列存在的。有没有办法实现这样的逻辑:如果 group_id 在至少 5 秒内出现,则仅将新的累积值附加为 unique_id?
import pandas as pd
df['dt2'] = pd.to_datetime(df['dt2'], errors='raise')
df.loc[[12, 20], 'dt2'] +=pd.to_timedelta(10, unit='s')
df['delta'] = df['dt2'].diff() / np.timedelta64(1, 's')
df['test'] = (df['delta'] >= 5).cumsum()
print(df)
输出:
group_id dt2 unique_id delta test
0 nan 2023-11-28 17:43:09.900628 1 NaN 0
1 1 2023-11-28 17:43:11.322793 2 1.422165 0
2 1 2023-11-28 17:43:12.660818 2 1.338025 0
3 1 2023-11-28 17:43:14.119043 2 1.458225 0
4 1 2023-11-28 17:43:15.550513 2 1.431470 0
5 2 2023-11-28 17:43:15.550513 3 0.000000 0
6 3 2023-11-28 17:43:15.550513 4 0.000000 0
7 4 2023-11-28 17:43:15.550513 5 0.000000 0
8 1 2023-11-28 17:43:16.973557 6 1.423044 0
9 2 2023-11-28 17:43:16.973557 7 0.000000 0
10 3 2023-11-28 17:43:16.973557 8 0.000000 0
11 4 2023-11-28 17:43:16.973557 9 0.000000 0
12 1 2023-11-28 17:43:28.335619 10 11.362062 1
13 2 2023-11-28 17:43:18.335619 11 -10.000000 1
14 3 2023-11-28 17:43:18.335619 12 0.000000 1
15 4 2023-11-28 17:43:18.335619 13 0.000000 1
16 1 2023-11-28 17:43:19.738230 14 1.402611 1
17 2 2023-11-28 17:43:19.738230 15 0.000000 1
18 3 2023-11-28 17:43:19.738230 16 0.000000 1
19 4 2023-11-28 17:43:19.738230 17 0.000000 1
20 1 2023-11-28 17:43:31.110693 18 11.372463 2
21 2 2023-11-28 17:43:21.110693 19 -10.000000 2
22 1 2023-11-28 17:43:22.571257 20 1.460564 2
23 2 2023-11-28 17:43:22.571257 21 0.000000 2
24 1 2023-11-28 17:43:24.000589 22 1.429332 2
25 1 2023-11-28 17:43:25.429940 22 1.429351 2
26 2 2023-11-28 17:43:25.429940 23 0.000000 2
27 1 2023-11-28 17:43:26.851142 24 1.421202 2
28 2 2023-11-28 17:43:26.851142 25 0.000000 2
29 1 2023-11-28 17:43:28.256274 26 1.405132 2
30 nan 2023-11-28 17:43:29.617541 27 1.361267 2
31 nan 2023-11-28 17:43:30.974490 27 1.356949 2
32 nan 2023-11-28 17:43:32.360739 27 1.386249 2
33 1 2023-11-28 17:43:33.730457 28 1.369718 2
34 1 2023-11-28 17:43:35.270380 28 1.539923 2