我想通过分配“组 ID”对在时间上接近发生的事件的数据行进行分组。
例如,考虑有来自计算机的事件日志,并且您希望将彼此接近发生的事件分组在一起。
tibble::tribble(
~event_id, ~happened_at,
"xyz", "2023-07-31 13:35:06",
"tsv", "2023-07-31 13:35:07",
"abc", "2023-07-31 13:41:30",
"fgh", "2023-07-31 13:42:05",
"dda", "2023-07-31 13:42:12",
"ggf", "2023-08-01 4:43:15",
"oor", "2023-08-01 13:49:36",
"wqe", "2023-08-01 14:33:10",
"oop", "2023-08-01 14:34:14"
)
我想添加另一列来说明事件组,这样如果在几秒钟内发生 2 个或更多事件,它们应该获得相同的“组 ID”。否则,单个事件将获得自己的组 ID。
# desired output
tibble::tribble(
~event_id, ~happened_at, ~group_id,
"xyz", "2023-07-31 13:35:06", 1,
"tsv", "2023-07-31 13:35:07", 1,
"abc", "2023-07-31 13:41:30", 2,
"fgh", "2023-07-31 13:42:05", 2,
"dda", "2023-07-31 13:42:12", 2,
"ggf", "2023-08-01 4:43:15", 3,
"oor", "2023-08-01 13:49:36", 4,
"wqe", "2023-08-01 14:33:10", 5,
"oop", "2023-08-01 14:34:14", 5
)
虽然这看起来是一个基本问题,但我想不出办法来做到这一点。关于这样的事情有什么想法或“最佳实践”吗?
您需要首先定义接近度。出于演示目的,我将其定义为 2 分钟,但您可以根据您的数据集进行选择。
library(dplyr)
library(lubridate)
cutoff_secs <- 120L
df %>%
mutate(happened_at = ymd_hms(happened_at)) %>%
arrange(happened_at) %>%
mutate(group_id = cumsum(as.integer(difftime(happened_at,
lag(happened_at, default = first(happened_at)),
units = "secs")) > cutoff_secs) + 1)
# event_id happened_at group_id
# <chr> <dttm> <dbl>
#1 xyz 2023-07-31 13:35:06 1
#2 tsv 2023-07-31 13:35:07 1
#3 abc 2023-07-31 13:41:30 2
#4 fgh 2023-07-31 13:42:05 2
#5 dda 2023-07-31 13:42:12 2
#6 ggf 2023-08-01 04:43:15 3
#7 oor 2023-08-01 13:49:36 4
#8 wqe 2023-08-01 14:33:10 5
#9 oop 2023-08-01 14:34:14 5
在
difftime
中,我们计算 happened_at
的连续值之间的差,如果它们超过 cutoff_secs
,则增加 group_id
的计数。