R中时间范围重叠(有余地)时如何对行进行分组?

问题描述 投票:0回答:1

我有数据记录从提供商处接收不同类型服务(streamid)的人员(personid)。每个服务期都有开始日期和结束日期,如果服务正在进行,则可能会缺少后者。

我需要将他们的行分组,其中他们有持续服务的时间段,允许上一个服务结束和下一个服务开始之间有一天的间隙。下面的数据已将目标手编码到表中。

library(tidyverse)
test <-  type_convert(tribble(
  ~personid, ~streamid, ~datetimestart, ~datetimeend, ~targetgroup,
          1,         1,   "2023-01-01", "2023-01-05",            1,
          1,         2,   "2023-01-07", "2023-01-30",            2,
                    
          2,         2,   "2023-12-01", NA_character_,           1, 
          2,         1,   "2024-01-12", "2024-01-30",            1,
          2,         3,   "2024-02-10", "2024-02-28",            1,
          2,         1,   "2024-02-25", NA_character_,           1,
                    
          3,         3,   "2023-12-01", "2024-01-14",            1, 
          3,         2,   "2024-01-12", "2024-01-30",            1,
          3,         1,   "2024-01-10", "2024-02-01",            1,
                    
          4,         3,   "2023-12-01", "2024-01-14",            1, 
          4,         2,   "2024-01-12", "2024-01-20",            1,
          4,         1,   "2024-01-21", NA_character_,           1
  
))
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   datetimestart = col_date(format = ""),
#>   datetimeend = col_date(format = "")
#> )

创建于 2024-01-23,使用 reprex v2.1.0

我已经能够获得 90% 想要使用滞后或合并的位置,但是像 personid == 2 组这样的情况,其中重叠位于非连续行之间(按开始日期排序)。这是迄今为止我的“最佳”结果的示例。

test %>% 
  arrange(personid, datetimestart) %>%
  group_by(personid) %>%
  mutate(new_episode_group = datetimestart - lag(datetimeend) > days(1),
         new_episode_group = if_else(is.na(new_episode_group), FALSE, new_episode_group),
         group = cumsum(new_episode_group) + 1) %>% 
  select(-new_episode_group)
#> # A tibble: 12 × 6
#> # Groups:   personid [4]
#>    personid streamid datetimestart datetimeend targetgroup group
#>       <dbl>    <dbl> <date>        <date>            <dbl> <dbl>
#>  1        1        1 2023-01-01    2023-01-05            1     1
#>  2        1        2 2023-01-07    2023-01-30            2     2
#>  3        2        2 2023-12-01    NA                    1     1
#>  4        2        1 2024-01-12    2024-01-30            1     1
#>  5        2        3 2024-02-10    2024-02-28            1     2
#>  6        2        1 2024-02-25    NA                    1     2
#>  7        3        3 2023-12-01    2024-01-14            1     1
#>  8        3        1 2024-01-10    2024-02-01            1     1
#>  9        3        2 2024-01-12    2024-01-30            1     1
#> 10        4        3 2023-12-01    2024-01-14            1     1
#> 11        4        2 2024-01-12    2024-01-20            1     1
#> 12        4        1 2024-01-21    NA                    1     1

创建于 2024-01-23,使用 reprex v2.1.0

r datetime range grouping
1个回答
0
投票

据我所知,

cummax
就是你想要的。 (注意:您需要将日期值转换为数字,因为
cummax
不能处理日期类)。

test %>% 
  replace_na(list(datetimeend = as.Date("2099-12-31"))) %>%  # filling with large value
  mutate(datetimeend_num = as.numeric(datetimeend)) %>%   # because cummax can't treat Date
  arrange(personid, datetimestart) %>% 
  group_by(personid) %>% 
  mutate(cummax_end = as.Date(cummax(datetimeend_num)), 
         new_episode_group = datetimestart - lag(cummax_end, default = cummax_end[1]) > days(1),
         group = cumsum(new_episode_group) + 1) %>% 
  ungroup() %>% 
  select(-c(datetimeend_num, cummax_end, new_episode_group))

© www.soinside.com 2019 - 2024. All rights reserved.