根据 Y/N 标志和连续跟踪列确定和标记唯一跨度[关闭]

问题描述 投票:0回答:3

根据初始指南编辑问题已结束: 我有一个大型数据集,在 12 年的时间内大约有 50 万会员,我需要根据 elig_flag 和 Continuous_elig_counter 确定给定月份中会员的行是否是在其首次注册(预注册)之前,或者是否是是他们的第 1、第 2、第 3 等注册/取消注册跨度。数据集中的最大注册跨度数为 11,这是我通过计算 elig_counter = 1 的不同次数来确定的。最复杂的部分似乎是确保逻辑跟踪按时间顺序排列的日期。

所需输出的虚拟数据示例:

       Month    ID Elig_Flag Continuous_Elig_Counter Enrollment_Span_Detail
 1:  1/1/2020 XX123         N                       0         Pre-Enrollment
 2:  2/1/2020 XX123         N                       0         Pre-Enrollment
 3:  3/1/2020 XX123         Y                       1       First Enrollment
 4:  4/1/2020 XX123         Y                       2       First Enrollment
 5:  5/1/2020 XX123         Y                       3       First Enrollment
 6:  6/1/2020 XX123         Y                       4       First Enrollment
 7:  7/1/2020 XX123         Y                       5       First Enrollment
 8:  8/1/2020 XX123         N                       0     First Unenrollment
 9:  9/1/2020 XX123         N                       0     First Unenrollment
10: 10/1/2020 XX123         N                       0     First Unenrollment
11: 11/1/2020 XX123         N                       0     First Unenrollment
12: 12/1/2020 XX123         Y                       1      Second Enrollment
13:  1/1/2021 XX123         Y                       2      Second Enrollment
14:  2/1/2021 XX123         Y                       3      Second Enrollment
15:  3/1/2021 XX123         Y                       4      Second Enrollment
16:  4/1/2021 XX123         Y                       5      Second Enrollment
17:  5/1/2021 XX123         Y                       6      Second Enrollment
18:  6/1/2021 XX123         Y                       7      Second Enrollment
19:  7/1/2021 XX123         N                       0    Second Unenrollment
20:  8/1/2021 XX123         Y                       1       Third Enrollment
21:  9/1/2021 XX123         Y                       2       Third Enrollment
22: 10/1/2021 XX123         Y                       3       Third Enrollment
23: 11/1/2021 XX123         Y                       4       Third Enrollment
24: 12/1/2021 XX123         Y                       5       Third Enrollment

我过去曾在类似情况下将 lag() 与 ifelse 语句结合使用,但这种情况似乎有点困难,到目前为止我还没有任何运气。

r dplyr tidyverse ranking
3个回答
1
投票

您可以使用

consecutive_id
查找
elig_flag
中的更改并使用
case_match
来应用值:

library(tidyverse)

# Sample data
df <- tibble(
  month = seq(mdy("1/1/2020"), mdy("12/1/2021"), by = "month"),
  id = "XX123",
  elig_flag = c(rep("N", 2), rep("Y", 5), rep("N", 4), rep("Y", 7), rep("N", 1), rep("Y", 5))
)

df |> 
  mutate(
    group = if_else(elig_flag == "N", 0, 1),
    group = consecutive_id(group),
    enroll = case_match(
      group,
      1 ~ "Pre-Enrollment",
      2 ~ "First Enrollment",
      3 ~ "First Unenrollment",
      4 ~ "Second Enrollment",
      5 ~ "Second Unenrollment",
      6 ~ "Third Enrollment",
      .default = "Other"
    ),
    .by = id)
#> # A tibble: 24 × 5
#>    month      id    elig_flag group enroll            
#>    <date>     <chr> <chr>     <int> <chr>             
#>  1 2020-01-01 XX123 N             1 Pre-Enrollment    
#>  2 2020-02-01 XX123 N             1 Pre-Enrollment    
#>  3 2020-03-01 XX123 Y             2 First Enrollment  
#>  4 2020-04-01 XX123 Y             2 First Enrollment  
#>  5 2020-05-01 XX123 Y             2 First Enrollment  
#>  6 2020-06-01 XX123 Y             2 First Enrollment  
#>  7 2020-07-01 XX123 Y             2 First Enrollment  
#>  8 2020-08-01 XX123 N             3 First Unenrollment
#>  9 2020-09-01 XX123 N             3 First Unenrollment
#> 10 2020-10-01 XX123 N             3 First Unenrollment
#> # ℹ 14 more rows

创建于 2024-04-21,使用 reprex v2.1.0


1
投票

显示任意数量的注册/“取消注册”的方法。基于日期和变量 Elig_Flag

的顺序
library(dplyr)

df %>% 
  mutate(grp = cumsum(lag(Elig_Flag, default="") != Elig_Flag & Elig_Flag == "Y"), 
         ESD = case_when(
                 grp == 0 & Elig_Flag == "N" ~ "Pre-Enrollment",
                 grp > 0 & Elig_Flag == "Y" ~ paste0(grp, ". ", "Enrollment"),
                 grp > 0 & Elig_Flag == "N" ~ paste0(grp, ". ", "Unenrollment")), 
         grp = NULL, .by = ID)

输出,仅显示 IDEnrollment_Span_Detail

      ID Enrollment_Span_Detail             ESD
1  XX123         Pre-Enrollment  Pre-Enrollment
2  XX123         Pre-Enrollment  Pre-Enrollment
3  XX123       First Enrollment   1. Enrollment
4  XX123       First Enrollment   1. Enrollment
5  XX123       First Enrollment   1. Enrollment
6  XX123       First Enrollment   1. Enrollment
7  XX123       First Enrollment   1. Enrollment
8  XX123     First Unenrollment 1. Unenrollment
9  XX123     First Unenrollment 1. Unenrollment
10 XX123     First Unenrollment 1. Unenrollment
11 XX123     First Unenrollment 1. Unenrollment
12 XX123      Second Enrollment   2. Enrollment
13 XX123      Second Enrollment   2. Enrollment
14 XX123      Second Enrollment   2. Enrollment
15 XX123      Second Enrollment   2. Enrollment
16 XX123      Second Enrollment   2. Enrollment
17 XX123      Second Enrollment   2. Enrollment
18 XX123      Second Enrollment   2. Enrollment
19 XX123    Second Unenrollment 2. Unenrollment
20 XX123       Third Enrollment   3. Enrollment
21 XX123       Third Enrollment   3. Enrollment
22 XX123       Third Enrollment   3. Enrollment
23 XX123       Third Enrollment   3. Enrollment
24 XX123       Third Enrollment   3. Enrollment

数据

df <- structure(list(Month = c("1/1/2020", "2/1/2020", "3/1/2020", 
"4/1/2020", "5/1/2020", "6/1/2020", "7/1/2020", "8/1/2020", "9/1/2020", 
"10/1/2020", "11/1/2020", "12/1/2020", "1/1/2021", "2/1/2021", 
"3/1/2021", "4/1/2021", "5/1/2021", "6/1/2021", "7/1/2021", "8/1/2021", 
"9/1/2021", "10/1/2021", "11/1/2021", "12/1/2021"), ID = c("XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123"), Elig_Flag = c("N", "N", "Y", "Y", "Y", "Y", 
"Y", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N", 
"Y", "Y", "Y", "Y", "Y"), Continuous_Elig_Counter = c(0L, 0L, 
1L, 2L, 3L, 4L, 5L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
0L, 1L, 2L, 3L, 4L, 5L), Enrollment_Span_Detail = c("Pre-Enrollment", 
"Pre-Enrollment", "First Enrollment", "First Enrollment", "First Enrollment", 
"First Enrollment", "First Enrollment", "First Unenrollment", 
"First Unenrollment", "First Unenrollment", "First Unenrollment", 
"Second Enrollment", "Second Enrollment", "Second Enrollment", 
"Second Enrollment", "Second Enrollment", "Second Enrollment", 
"Second Enrollment", "Second Unenrollment", "Third Enrollment", 
"Third Enrollment", "Third Enrollment", "Third Enrollment", "Third Enrollment"
)), class = "data.frame", row.names = c(NA, -24L))

0
投票

我的贡献:使用

nombre

这个很酷的pkg有助于将数字写成基数词、序数词、副词、数字、比率等。

完整拍摄:

library(tidyverse)
library(nombre)

# -----------------
df <- df %>% 
  mutate(.by = c(ID), esd = trunc(consecutive_id(Elig_Flag)/2)) %>% 
  mutate(
    .by = c(ID, Elig_Flag),
    esd = case_when(
      Elig_Flag == "N" & esd == 0 ~ "Pre-Enrollment", 
      Elig_Flag == "N" & esd != 0 ~ paste0(nom_ord(esd), " Unenrollment"), # nombre::num_ord
      Elig_Flag == "Y" ~ paste0(nom_ord(esd), " Enrollment")) %>% 
      
      str_to_title()) %>% 
  
  rename(Enrollment_Span_Detail = esd) # That's a big var name 

输出:

> df
       Month    ID Elig_Flag Continuous_Elig_Counter     expected_output Enrollment_Span_Detail
1   1/1/2020 XX123         N                       0      Pre-Enrollment         Pre-Enrollment
2   2/1/2020 XX123         N                       0      Pre-Enrollment         Pre-Enrollment
3   3/1/2020 XX123         Y                       1    First Enrollment       First Enrollment
4   4/1/2020 XX123         Y                       2    First Enrollment       First Enrollment
5   5/1/2020 XX123         Y                       3    First Enrollment       First Enrollment
6   6/1/2020 XX123         Y                       4    First Enrollment       First Enrollment
7   7/1/2020 XX123         Y                       5    First Enrollment       First Enrollment
8   8/1/2020 XX123         N                       0  First Unenrollment     First Unenrollment
9   9/1/2020 XX123         N                       0  First Unenrollment     First Unenrollment
10 10/1/2020 XX123         N                       0  First Unenrollment     First Unenrollment
11 11/1/2020 XX123         N                       0  First Unenrollment     First Unenrollment
12 12/1/2020 XX123         Y                       1   Second Enrollment      Second Enrollment
13  1/1/2021 XX123         Y                       2   Second Enrollment      Second Enrollment
14  2/1/2021 XX123         Y                       3   Second Enrollment      Second Enrollment
15  3/1/2021 XX123         Y                       4   Second Enrollment      Second Enrollment
16  4/1/2021 XX123         Y                       5   Second Enrollment      Second Enrollment
17  5/1/2021 XX123         Y                       6   Second Enrollment      Second Enrollment
18  6/1/2021 XX123         Y                       7   Second Enrollment      Second Enrollment
19  7/1/2021 XX123         N                       0 Second Unenrollment    Second Unenrollment
20  8/1/2021 XX123         Y                       1    Third Enrollment       Third Enrollment
21  9/1/2021 XX123         Y                       2    Third Enrollment       Third Enrollment
22 10/1/2021 XX123         Y                       3    Third Enrollment       Third Enrollment
23 11/1/2021 XX123         Y                       4    Third Enrollment       Third Enrollment
24 12/1/2021 XX123         Y                       5    Third Enrollment       Third Enrollment

玩具数据由@Andre_Wildberd提供

df <- structure(list(Month = c("1/1/2020", "2/1/2020", "3/1/2020", 
"4/1/2020", "5/1/2020", "6/1/2020", "7/1/2020", "8/1/2020", "9/1/2020", 
"10/1/2020", "11/1/2020", "12/1/2020", "1/1/2021", "2/1/2021", 
"3/1/2021", "4/1/2021", "5/1/2021", "6/1/2021", "7/1/2021", "8/1/2021", 
"9/1/2021", "10/1/2021", "11/1/2021", "12/1/2021"), ID = c("XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123", 
"XX123", "XX123"), Elig_Flag = c("N", "N", "Y", "Y", "Y", "Y", 
"Y", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N", 
"Y", "Y", "Y", "Y", "Y"), Continuous_Elig_Counter = c(0L, 0L, 
1L, 2L, 3L, 4L, 5L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
0L, 1L, 2L, 3L, 4L, 5L), Enrollment_Span_Detail = c("Pre-Enrollment", 
"Pre-Enrollment", "First Enrollment", "First Enrollment", "First Enrollment", 
"First Enrollment", "First Enrollment", "First Unenrollment", 
"First Unenrollment", "First Unenrollment", "First Unenrollment", 
"Second Enrollment", "Second Enrollment", "Second Enrollment", 
"Second Enrollment", "Second Enrollment", "Second Enrollment", 
"Second Enrollment", "Second Unenrollment", "Third Enrollment", 
"Third Enrollment", "Third Enrollment", "Third Enrollment", "Third Enrollment"
)), class = "data.frame", row.names = c(NA, -24L))

df <- rename(expected_output = Enrollment_Span_Detail)

© www.soinside.com 2019 - 2024. All rights reserved.