我有数据记录从提供商处接收不同类型服务(streamid)的人员(personid)。每个服务期都有开始日期和结束日期,如果服务正在进行,则可能会缺少后者。
我需要将他们的行分组,其中他们有持续服务的时间段,允许上一个服务结束和下一个服务开始之间有一天的间隙。下面的数据已将目标手编码到表中。
library(tidyverse)
test <- type_convert(tribble(
~personid, ~streamid, ~datetimestart, ~datetimeend, ~targetgroup,
1, 1, "2023-01-01", "2023-01-05", 1,
1, 2, "2023-01-07", "2023-01-30", 2,
2, 2, "2023-12-01", NA_character_, 1,
2, 1, "2024-01-12", "2024-01-30", 1,
2, 3, "2024-02-10", "2024-02-28", 1,
2, 1, "2024-02-25", NA_character_, 1,
3, 3, "2023-12-01", "2024-01-14", 1,
3, 2, "2024-01-12", "2024-01-30", 1,
3, 1, "2024-01-10", "2024-02-01", 1,
4, 3, "2023-12-01", "2024-01-14", 1,
4, 2, "2024-01-12", "2024-01-20", 1,
4, 1, "2024-01-21", NA_character_, 1
))
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> datetimestart = col_date(format = ""),
#> datetimeend = col_date(format = "")
#> )
创建于 2024-01-23,使用 reprex v2.1.0
我已经能够获得 90% 想要使用滞后或合并的位置,但是像 personid == 2 组这样的情况,其中重叠位于非连续行之间(按开始日期排序)。这是迄今为止我的“最佳”结果的示例。
test %>%
arrange(personid, datetimestart) %>%
group_by(personid) %>%
mutate(new_episode_group = datetimestart - lag(datetimeend) > days(1),
new_episode_group = if_else(is.na(new_episode_group), FALSE, new_episode_group),
group = cumsum(new_episode_group) + 1) %>%
select(-new_episode_group)
#> # A tibble: 12 × 6
#> # Groups: personid [4]
#> personid streamid datetimestart datetimeend targetgroup group
#> <dbl> <dbl> <date> <date> <dbl> <dbl>
#> 1 1 1 2023-01-01 2023-01-05 1 1
#> 2 1 2 2023-01-07 2023-01-30 2 2
#> 3 2 2 2023-12-01 NA 1 1
#> 4 2 1 2024-01-12 2024-01-30 1 1
#> 5 2 3 2024-02-10 2024-02-28 1 2
#> 6 2 1 2024-02-25 NA 1 2
#> 7 3 3 2023-12-01 2024-01-14 1 1
#> 8 3 1 2024-01-10 2024-02-01 1 1
#> 9 3 2 2024-01-12 2024-01-30 1 1
#> 10 4 3 2023-12-01 2024-01-14 1 1
#> 11 4 2 2024-01-12 2024-01-20 1 1
#> 12 4 1 2024-01-21 NA 1 1
创建于 2024-01-23,使用 reprex v2.1.0
据我所知,
cummax
就是你想要的。
(注意:您需要将日期值转换为数字,因为cummax
不能处理日期类)。
test %>%
replace_na(list(datetimeend = as.Date("2099-12-31"))) %>% # filling with large value
mutate(datetimeend_num = as.numeric(datetimeend)) %>% # because cummax can't treat Date
arrange(personid, datetimestart) %>%
group_by(personid) %>%
mutate(cummax_end = as.Date(cummax(datetimeend_num)),
new_episode_group = datetimestart - lag(cummax_end, default = cummax_end[1]) > days(1),
group = cumsum(new_episode_group) + 1) %>%
ungroup() %>%
select(-c(datetimeend_num, cummax_end, new_episode_group))