如何对时间相近发生的事件进行分组

问题描述 投票:0回答:1

我想通过分配“组 ID”对在时间上接近发生的事件的数据行进行分组。

例如,考虑有来自计算机的事件日志,并且您希望将彼此接近发生的事件分组在一起。

tibble::tribble(
  ~event_id,          ~happened_at,
      "xyz", "2023-07-31 13:35:06",
      "tsv", "2023-07-31 13:35:07",
      "abc", "2023-07-31 13:41:30",
      "fgh", "2023-07-31 13:42:05",
      "dda", "2023-07-31 13:42:12",
      "ggf",  "2023-08-01 4:43:15",
      "oor", "2023-08-01 13:49:36",
      "wqe", "2023-08-01 14:33:10",
      "oop", "2023-08-01 14:34:14"
  )

我想添加另一列来说明事件组,这样如果在几秒钟内发生 2 个或更多事件,它们应该获得相同的“组 ID”。否则,单个事件将获得自己的组 ID。

# desired output
tibble::tribble(
  ~event_id,          ~happened_at, ~group_id,
      "xyz", "2023-07-31 13:35:06",        1,
      "tsv", "2023-07-31 13:35:07",        1,
      "abc", "2023-07-31 13:41:30",        2,
      "fgh", "2023-07-31 13:42:05",        2,
      "dda", "2023-07-31 13:42:12",        2,
      "ggf",  "2023-08-01 4:43:15",        3,
      "oor", "2023-08-01 13:49:36",        4,
      "wqe", "2023-08-01 14:33:10",        5,
      "oop", "2023-08-01 14:34:14",        5
  )

虽然这看起来是一个基本问题,但我想不出办法来做到这一点。关于这样的事情有什么想法或“最佳实践”吗?

r dataframe dplyr time-series lubridate
1个回答
0
投票

您需要首先定义接近度。出于演示目的,我将其定义为 2 分钟,但您可以根据您的数据集进行选择。

library(dplyr)
library(lubridate)

cutoff_secs <- 120L 

df %>%
  mutate(happened_at = ymd_hms(happened_at)) %>%
  arrange(happened_at) %>%
  mutate(group_id = cumsum(as.integer(difftime(happened_at, 
                           lag(happened_at, default = first(happened_at)), 
                           units = "secs")) > cutoff_secs) + 1)

#  event_id happened_at         group_id
#  <chr>    <dttm>                 <dbl>
#1 xyz      2023-07-31 13:35:06        1
#2 tsv      2023-07-31 13:35:07        1
#3 abc      2023-07-31 13:41:30        2
#4 fgh      2023-07-31 13:42:05        2
#5 dda      2023-07-31 13:42:12        2
#6 ggf      2023-08-01 04:43:15        3
#7 oor      2023-08-01 13:49:36        4
#8 wqe      2023-08-01 14:33:10        5
#9 oop      2023-08-01 14:34:14        5 

difftime
中,我们计算
happened_at
的连续值之间的差,如果它们超过
cutoff_secs
,则增加
group_id
的计数。

© www.soinside.com 2019 - 2024. All rights reserved.