我有一个具有(简化)结构的表格日志文件:
<time>, <event_tag>
,并且想要查找两个不同的event_tags
、problem
和all fine
之间的间隔。困难在于all fine
和problem
经常重复。
因此,手动算法将找到第一个 problem
并寻找下一个 all fine
,然后继续,直到最后一个 problem
和后续 all fine
。
样本数据集:
library(data.table)
set.seed(156125)
DT <- data.table(time = seq(as.POSIXct(tz = "UTC", "2024-01-01"),
as.POSIXct(tz = "UTC", "2024-01-10"),
by = "12 hours"),
event_tag = c("problem", "all fine")[round(runif(19, 1.2, 2.49))])
# time event_tag
# <POSc> <char>
# 1: 2024-01-01 00:00:00 all fine
# 2: 2024-01-01 12:00:00 all fine
# 3: 2024-01-02 00:00:00 problem
# 4: 2024-01-02 12:00:00 all fine
# 5: 2024-01-03 00:00:00 all fine
# 6: 2024-01-03 12:00:00 problem
# 7: 2024-01-04 00:00:00 problem
# 8: 2024-01-04 12:00:00 all fine
# 9: 2024-01-05 00:00:00 all fine
# 10: 2024-01-05 12:00:00 all fine
# 11: 2024-01-06 00:00:00 all fine
# 12: 2024-01-06 12:00:00 all fine
# 13: 2024-01-07 00:00:00 problem
# 14: 2024-01-07 12:00:00 all fine
# 15: 2024-01-08 00:00:00 all fine
# 16: 2024-01-08 12:00:00 problem
# 17: 2024-01-09 00:00:00 all fine
# 18: 2024-01-09 12:00:00 all fine
# 19: 2024-01-10 00:00:00 all fine
想要的结果:
data.table(problem_start = DT$time[c(3, 6, 13, 16)],
problem_end = DT$time[c(4, 8, 14, 17)])
# problem_start problem_end
# <POSc> <POSc>
# 1: 2024-01-02 00:00:00 2024-01-02 12:00:00
# 2: 2024-01-03 12:00:00 2024-01-04 12:00:00
# 3: 2024-01-07 00:00:00 2024-01-07 12:00:00
# 4: 2024-01-08 12:00:00 2024-01-09 00:00:00
我想了一些解决方案,通过制作两个标签
boolean
并使用cumsum
,但无法完全弄清楚。也许有一种简洁的 data.table
方法可以做到这一点,但我目前还没有看到。然而,即使我更喜欢 dplyr
,我也会对 data.table
解决方案感到满意。
DT[ , bool := ifelse(event_tag == "all fine", 0, 1)]
DT[ , cumsum(bool)]
您可以将结束时间滚动连接到开始时间,然后获取每个结束时间的最早开始时间:
starts <- DT[event_tag=="problem", .(problem_start = time)]
ends <- DT[event_tag=="all fine", .(time, problem_end = time)]
result <- ends[starts, on=.(time==problem_start), roll=-Inf][, .(problem_start = first(time)), by=problem_end]
setorder(result, problem_start)
setcolorder(result, c("problem_start", "problem_end"))