我有一个关于故障时间的数据集。故障的开始点是
Begin
列,结束点是 End
列。> df_time
# A tibble: 8 × 3
Category Begin End
<chr> <dttm> <dttm>
1 A 2023-07-15 01:40:11 2023-07-15 13:43:15
2 A 2023-07-16 05:54:44 2023-07-16 10:50:45
3 B 2023-08-16 07:43:09 2023-08-16 16:42:12
4 C 2023-08-16 12:00:00 2023-08-16 13:11:13
5 A 2023-08-16 18:00:00 2023-08-16 19:30:00
6 A 2023-08-17 08:00:00 2023-08-17 13:00:00
7 C 2023-08-17 11:12:45 2023-08-17 19:58:22
8 A 2023-08-17 19:01:45 2023-08-17 23:59:59
现在我想计算每个日历日的摘要,显示有多少分钟(或小时或秒)没有发生任何故障。
这是每天所有绿灯时间间隔的总和:
如果只有一种故障类别,只需计算 24 小时减去故障间隔即可。
然而,我没有进一步处理不同的重叠故障间隔。
谁能帮我计算无故障周期和/或每日故障周期。
我想象这样的评估有点:(结果可能与上面的示例数据集不完全相同)
Date OK_times malfunction_times
<date> <chr> <chr>
1 2023-07-15 17 hours 7 hours
2 2023-07-16 5 Hours 19 hours
3 2023-07-17 3 hours 21 hours
这是创建示例数据框的代码:
library(tidyverse)
df_time <- tibble(
Category = c(
"A",
"A",
"B",
"C",
"A",
"A",
"C",
"A"
),
Begin = as_datetime(c(
"2023-07-15 01:40:11",
"2023-07-16 05:54:44",
"2023-08-16 07:43:09",
"2023-08-16 12:00:00",
"2023-08-16 18:00:00",
"2023-08-17 08:00:00",
"2023-08-17 11:12:45",
"2023-08-17 19:01:45"
)),
End = as_datetime(c(
"2023-07-15 13:43:15",
"2023-07-16 10:50:45",
"2023-08-16 16:42:12",
"2023-08-16 13:11:13",
"2023-08-16 19:30:00",
"2023-08-17 13:00:00",
"2023-08-17 19:58:22",
"2023-08-17 23:59:59"
))
)
interval_intersects <- function(i1, i2) {
# check if there's an overlap between the two intervals
between(i1$Begin, i2$Begin, i2$End) | between(i1$End, i2$Begin, i2$End) | between(i2$Begin, i1$Begin, i1$End) | between(i2$End, i1$Begin, i1$End)
}
tidy_intervals <- function(df) {
out <- df[0,] # empty tibble with the columns of the input df
while (nrow(df) > 0) {
matched = FALSE
if (nrow(out) != 0) { # this is so janky but R will try to loop through the dataframe, even when it has no rows, even using seq_along
# for each row in out, check if it intersects with the current row
for (j in 1:nrow(out)) {
# if it does, update the current output row to be the minimum of the two begin times and the maximum of the two end times
if (interval_intersects(df[1, ], out[j, ])) {
matched = TRUE
out[j, ] <- tibble(
Begin = min(df[1,]$Begin, out[j,]$Begin),
End = max(df[1,]$End, out[j,]$End)
)
break
}
}
}
# if the current row didn't intersect with any of the output rows, append it to the output
if (!matched) {
out <- out |> add_row(
Begin = df[1,]$Begin,
End = df[1,]$End)
}
# remove the current row from the input
df <- df[-1, ]
}
return(out)
}
tidy_intervals(df_time |> select(-Category)) |>
##### everything from here until later is from the old answer https://stackoverflow.com/a/76905774/4145280 #####
mutate(b = as.Date(Begin), e = as.Date(End),
# create a sequence of dates between begin and end
days = map2(b, e, ~ seq.Date(.x, .y, by = "1 day"))) |>
# unnest the days column into many rows
unnest(days) |>
# if the beginning date is the same as the date in `days`, then use the original Begin column
# else, use `days` as a datetime
mutate(Begin = if_else(b == days, Begin, as_datetime(days)),
# same with End, but subtracting one minute
End = if_else(e == days, End, as_datetime(days) + days(1) - seconds(1)), .keep = "unused") |>
#### new stuff starts here ####
mutate(Date = as.Date(Begin),
malfunction_times = End - Begin) |>
reframe(malfunction_times = round(sum(malfunction_times)),
OK_times = 24 - malfunction_times, .by = Date)
输出:
# A tibble: 4 × 3
Date malfunction_times OK_times
<date> <drtn> <drtn>
1 2023-07-15 12 hours 12 hours
2 2023-07-16 5 hours 19 hours
3 2023-08-16 10 hours 14 hours
4 2023-08-17 16 hours 8 hours
我可以建议一个更直观但效率较低的替代方案吗?主要思想是将所有间隔创建为秒向量,然后消除重复项(重叠),最后检查每天有多少秒的故障发生在这一天:
# get days as numerics
begin_day <- as.numeric(as_datetime(date(df_time$Begin)))
days <- unique(begin_day)
# get times as numerics
begin_time <- as.numeric(df_time$Begin)
end_time <- as.numeric(df_time$End)
# create intervals as vectors of seconds and make them unique
# in other words: all overlaps are merged
intervals <- Map(function(x, y) x:y, begin_time, end_time)
intervals_unique <- unique(unlist(intervals))
# now we simply check how many seconds of a day overlap with the intervals
res <- sapply(days, function(x) sum(intervals_unique >= x & intervals_unique < x + 24 * 60 * 60))
data.frame(date = as_datetime(days), malfunction = res / (60 * 60))
制作:
date malfunction
1 2023-07-15 12.051389
2 2023-07-16 4.933889
3 2023-08-16 10.484722
4 2023-08-17 16.000000
通过过滤特定日期的数据帧然后应用该过程可以提高效率。这将使
intervals_unique
向量变小。在这里,我只是想展示一下总体思路。结果与 Mark 的解决方案相符。