我有以下数据框 df(下面的
dput
):
> df
group from to
1 A 2023-03-01 2023-03-02
2 A 2023-03-01 2023-03-03
3 A 2023-03-03 2023-03-07
4 A 2023-03-05 2023-03-08
5 A 2023-03-09 2023-03-10
6 A 2023-03-11 2023-03-11
7 B 2023-03-01 2023-03-02
8 B 2023-03-04 2023-03-06
9 B 2023-03-07 2023-03-07
10 B 2023-03-08 2023-03-11
11 B 2023-03-10 2023-03-12
12 B 2023-03-15 2023-03-16
我想根据 from 和 to 列计算每组的重叠日期间隔数。在 A 组中,第 1 行和第 2 行重叠,第 3 行与第 2 行和第 4 行重叠,因此这意味着 A 组共有 3 个重叠间隔。在 B 组中,只有第 10 行和第 11 行重叠。所以我想有以下输出:
group overlaying_intervals
1 A 3
2 B 1
所以我想知道有没有人知道如何计算每组重叠日期间隔的数量?
dput
df:
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), from = c("2023-03-01", "2023-03-01", "2023-03-03",
"2023-03-05", "2023-03-09", "2023-03-11", "2023-03-01", "2023-03-04",
"2023-03-07", "2023-03-08", "2023-03-10", "2023-03-15"), to = c("2023-03-02",
"2023-03-03", "2023-03-07", "2023-03-08", "2023-03-10", "2023-03-11",
"2023-03-02", "2023-03-06", "2023-03-07", "2023-03-11", "2023-03-12",
"2023-03-16")), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
感觉应该有一种更优雅的方法来实现这一点,但我的第一个倾向是计算所有重叠间隔,然后考虑与 self 的重叠并重复计算每对重叠。
library(lubridate)
library(dplyr)
library(purrr)
df %>%
group_by(group) %>%
mutate(int = interval(from, to),
# count overlapping intervals, subtracting overlap with self
overlays = (map_int(int, ~sum(int_overlaps(.x, int))))-1) %>%
# divide total by 2 since each pairwise overlap is counted twice
summarize(overlaying_intervals = sum(overlays)/2)
#> # A tibble: 2 × 2
#> group overlaying_intervals
#> <chr> <dbl>
#> 1 A 3
#> 2 B 1
创建于 2023-03-31 与 reprex v2.0.2
基础 R 方法。
by(df, df$group, \(x){
dc <- c("from", "to")
x[dc] <- lapply(x[dc], \(x) as.numeric(as.Date(x)))
U <- apply(x[dc], 1, \(z) z[1]:z[2])
outer(U, U, Vectorize(\(x, y) length(intersect(x, y)) > 0)) |> `diag<-`(0) |> sum() |> base::`/`(2)
}) |> as.table() |> as.data.frame()
# df.group Freq
# 1 A 3
# 2 B 1
由hand
于2023-03-31创建资料:
df <- structure(list(group = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), from = c("2023-03-01", "2023-03-01", "2023-03-03",
"2023-03-05", "2023-03-09", "2023-03-11", "2023-03-01", "2023-03-04",
"2023-03-07", "2023-03-08", "2023-03-10", "2023-03-15"), to = c("2023-03-02",
"2023-03-03", "2023-03-07", "2023-03-08", "2023-03-10", "2023-03-11",
"2023-03-02", "2023-03-06", "2023-03-07", "2023-03-11", "2023-03-12",
"2023-03-16")), class = "data.frame", row.names = c("1", "2",
"3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
这是一个
data.table
选项使用foverlaps
setDT(df)
rev(
stack(
lapply(
split(
setkey(df[, lapply(.SD, as.IDate), group], from, to),
by = "group"
),
function(x) {
foverlaps(x, x, which = TRUE)[xid < yid, .N]
}
)
)
)
这给
ind values
1 A 3
2 B 1
我认为@Seth 的想法是正确的,但是您可以通过使用
ivs::iv_count_overlaps()
更有效地计算所有重叠来构建它,这将比逐行迭代更有效。
ivs 是一个专为间隔工作而设计的软件包,因此非常适合这个。
关于 ivs 的主要知识是间隔是半开的,即
[ )
,所以你需要在你的 to
日期上加 1。
library(dplyr, warn.conflicts = FALSE)
library(ivs)
df <- tibble::tribble(
~group, ~from, ~to,
"A", "2023-03-01", "2023-03-02",
"A", "2023-03-01", "2023-03-03",
"A", "2023-03-03", "2023-03-07",
"A", "2023-03-05", "2023-03-08",
"A", "2023-03-09", "2023-03-10",
"A", "2023-03-11", "2023-03-11",
"B", "2023-03-01", "2023-03-02",
"B", "2023-03-04", "2023-03-06",
"B", "2023-03-07", "2023-03-07",
"B", "2023-03-08", "2023-03-11",
"B", "2023-03-10", "2023-03-12",
"B", "2023-03-15", "2023-03-16"
)
df <- df %>%
mutate(from = as.Date(from), to = as.Date(to)) %>%
mutate(range = iv(from, to + 1L), .keep = "unused")
df
#> # A tibble: 12 × 2
#> group range
#> <chr> <iv<date>>
#> 1 A [2023-03-01, 2023-03-03)
#> 2 A [2023-03-01, 2023-03-04)
#> 3 A [2023-03-03, 2023-03-08)
#> 4 A [2023-03-05, 2023-03-09)
#> 5 A [2023-03-09, 2023-03-11)
#> 6 A [2023-03-11, 2023-03-12)
#> 7 B [2023-03-01, 2023-03-03)
#> 8 B [2023-03-04, 2023-03-07)
#> 9 B [2023-03-07, 2023-03-08)
#> 10 B [2023-03-08, 2023-03-12)
#> 11 B [2023-03-10, 2023-03-13)
#> 12 B [2023-03-15, 2023-03-17)
# Count all overlaps, then:
# - Subtract 1 for self-overlaps
# - Divide by 2 to get rid of doubly counted pairwise overlaps
df %>%
mutate(count = iv_count_overlaps(range, range), .by = group) %>%
mutate(count = count - 1L) %>%
summarise(count = sum(count) / 2, .by = group)
#> # A tibble: 2 × 2
#> group count
#> <chr> <dbl>
#> 1 A 3
#> 2 B 1
• R:创建一个运行计数变量,根据前一行的值计算,根据条件重置
• 如何在 SQL 中获取 6 个月间隔内列的不同日期计数总和?
• 按日期计算行组考虑开始日期和结束日期之间的天数_python
• 如何使用 np.busday_count 计算工作日数,不包括加拿大假期
• 如何计算JAVA中两个不同日期之间的工作天数(不包括周末)?
• 每年每天统计事件
• 如何根据 Oracle 中的开始和结束日期对记录进行重复数据删除
• 计算投资组合权重
• 如何使用 30 分钟的时间间隔获取一天中从开始时间到结束时间的人数
• Python-polars:rolling_sum 其中来自另一列的 window_size