我想根据分组增量将新 ID 附加到当前组(id 列),该增量指示组(id 列)没有出现在数据框中的时间。如果增量超过 5(秒),我想将新 ID 附加到我的 ID,如“expected_id”列中所示。新的 ID 应该是唯一的。
关于如何做到这一点有什么建议吗?
这是我的数据(df):
id datetime delta expected_id
0 2023-12-04 10:51:30.158743 nan 0
1 2023-12-04 10:51:31.734037 nan 1
1 2023-12-04 10:51:33.219067 1.48 1
1 2023-12-04 10:51:34.469723 1.25 1
0 2023-12-04 10:51:35.862997 5.70 2
0 2023-12-04 10:51:37.280209 1.41 2
0 2023-12-04 10:51:38.741301 1.46 2
0 2023-12-04 10:51:40.239296 1.49 2
1 2023-12-04 10:51:41.590683 7.12 3
1 2023-12-04 10:51:43.060751 1.47 3
1 2023-12-04 10:51:44.566724 1.50 3
1 2023-12-04 10:51:46.066713 1.49 3
0 2023-12-04 10:51:47.493897 7.25 4
0 2023-12-04 10:51:48.994885 1.50 4
0 2023-12-04 10:51:50.557707 1.56 4
0 2023-12-04 10:51:52.116537 1.55 4
0 2023-12-04 10:51:53.642456 1.52 4
1 2023-12-04 10:51:55.115518 9.04 5
我用这个实现的增量列:
df['delta'] = df.groupby("id")['datetime'].diff() / np.timedelta64(1, 's')
如何创建我的“expected_id”列?
如果我理解正确,您只需将
id
转换为新的预期 id,如下所示:
df["new_expected_id"] = (df["id"] != df["id"].shift()).cumsum() - 1
print(df)
打印:
id datetime delta expected_id new_expected_id
0 0 2023-12-04 10:51:30.158743 NaN 0 0
1 1 2023-12-04 10:51:31.734037 NaN 1 1
2 1 2023-12-04 10:51:33.219067 1.48 1 1
3 1 2023-12-04 10:51:34.469723 1.25 1 1
4 0 2023-12-04 10:51:35.862997 5.70 2 2
5 0 2023-12-04 10:51:37.280209 1.41 2 2
6 0 2023-12-04 10:51:38.741301 1.46 2 2
7 0 2023-12-04 10:51:40.239296 1.49 2 2
8 1 2023-12-04 10:51:41.590683 7.12 3 3
9 1 2023-12-04 10:51:43.060751 1.47 3 3
10 1 2023-12-04 10:51:44.566724 1.50 3 3
11 1 2023-12-04 10:51:46.066713 1.49 3 3
12 0 2023-12-04 10:51:47.493897 7.25 4 4
13 0 2023-12-04 10:51:48.994885 1.50 4 4
14 0 2023-12-04 10:51:50.557707 1.56 4 4
15 0 2023-12-04 10:51:52.116537 1.55 4 4
16 0 2023-12-04 10:51:53.642456 1.52 4 4
17 1 2023-12-04 10:51:55.115518 9.04 5 5
编辑:
group_mask = (df["id"] != df["id"].shift()).cumsum()
df["new_expected_id"] = df["delta"] > 5.0
df["new_expected_id"] = (
df.groupby(["id", group_mask], sort=False)["new_expected_id"]
.apply(lambda x: f"{x.name[1]} " + x.cumsum().astype(str))
.values
)
df["new_expected_id"] = pd.Categorical(df["new_expected_id"]).codes
print(df)