根据初始指南编辑问题已结束: 我有一个大型数据集,在 12 年的时间内大约有 50 万会员,我需要根据 elig_flag 和 Continuous_elig_counter 确定给定月份中会员的行是否是在其首次注册(预注册)之前,或者是否是是他们的第 1、第 2、第 3 等注册/取消注册跨度。数据集中的最大注册跨度数为 11,这是我通过计算 elig_counter = 1 的不同次数来确定的。最复杂的部分似乎是确保逻辑跟踪按时间顺序排列的日期。
所需输出的虚拟数据示例:
Month ID Elig_Flag Continuous_Elig_Counter Enrollment_Span_Detail
1: 1/1/2020 XX123 N 0 Pre-Enrollment
2: 2/1/2020 XX123 N 0 Pre-Enrollment
3: 3/1/2020 XX123 Y 1 First Enrollment
4: 4/1/2020 XX123 Y 2 First Enrollment
5: 5/1/2020 XX123 Y 3 First Enrollment
6: 6/1/2020 XX123 Y 4 First Enrollment
7: 7/1/2020 XX123 Y 5 First Enrollment
8: 8/1/2020 XX123 N 0 First Unenrollment
9: 9/1/2020 XX123 N 0 First Unenrollment
10: 10/1/2020 XX123 N 0 First Unenrollment
11: 11/1/2020 XX123 N 0 First Unenrollment
12: 12/1/2020 XX123 Y 1 Second Enrollment
13: 1/1/2021 XX123 Y 2 Second Enrollment
14: 2/1/2021 XX123 Y 3 Second Enrollment
15: 3/1/2021 XX123 Y 4 Second Enrollment
16: 4/1/2021 XX123 Y 5 Second Enrollment
17: 5/1/2021 XX123 Y 6 Second Enrollment
18: 6/1/2021 XX123 Y 7 Second Enrollment
19: 7/1/2021 XX123 N 0 Second Unenrollment
20: 8/1/2021 XX123 Y 1 Third Enrollment
21: 9/1/2021 XX123 Y 2 Third Enrollment
22: 10/1/2021 XX123 Y 3 Third Enrollment
23: 11/1/2021 XX123 Y 4 Third Enrollment
24: 12/1/2021 XX123 Y 5 Third Enrollment
我过去曾在类似情况下将 lag() 与 ifelse 语句结合使用,但这种情况似乎有点困难,到目前为止我还没有任何运气。
您可以使用
consecutive_id
查找 elig_flag
中的更改并使用 case_match
来应用值:
library(tidyverse)
# Sample data
df <- tibble(
month = seq(mdy("1/1/2020"), mdy("12/1/2021"), by = "month"),
id = "XX123",
elig_flag = c(rep("N", 2), rep("Y", 5), rep("N", 4), rep("Y", 7), rep("N", 1), rep("Y", 5))
)
df |>
mutate(
group = if_else(elig_flag == "N", 0, 1),
group = consecutive_id(group),
enroll = case_match(
group,
1 ~ "Pre-Enrollment",
2 ~ "First Enrollment",
3 ~ "First Unenrollment",
4 ~ "Second Enrollment",
5 ~ "Second Unenrollment",
6 ~ "Third Enrollment",
.default = "Other"
),
.by = id)
#> # A tibble: 24 × 5
#> month id elig_flag group enroll
#> <date> <chr> <chr> <int> <chr>
#> 1 2020-01-01 XX123 N 1 Pre-Enrollment
#> 2 2020-02-01 XX123 N 1 Pre-Enrollment
#> 3 2020-03-01 XX123 Y 2 First Enrollment
#> 4 2020-04-01 XX123 Y 2 First Enrollment
#> 5 2020-05-01 XX123 Y 2 First Enrollment
#> 6 2020-06-01 XX123 Y 2 First Enrollment
#> 7 2020-07-01 XX123 Y 2 First Enrollment
#> 8 2020-08-01 XX123 N 3 First Unenrollment
#> 9 2020-09-01 XX123 N 3 First Unenrollment
#> 10 2020-10-01 XX123 N 3 First Unenrollment
#> # ℹ 14 more rows
创建于 2024-04-21,使用 reprex v2.1.0
显示任意数量的注册/“取消注册”的方法。基于日期和变量 Elig_Flag
的顺序library(dplyr)
df %>%
mutate(grp = cumsum(lag(Elig_Flag, default="") != Elig_Flag & Elig_Flag == "Y"),
ESD = case_when(
grp == 0 & Elig_Flag == "N" ~ "Pre-Enrollment",
grp > 0 & Elig_Flag == "Y" ~ paste0(grp, ". ", "Enrollment"),
grp > 0 & Elig_Flag == "N" ~ paste0(grp, ". ", "Unenrollment")),
grp = NULL, .by = ID)
输出,仅显示 ID 和 Enrollment_Span_Detail
ID Enrollment_Span_Detail ESD
1 XX123 Pre-Enrollment Pre-Enrollment
2 XX123 Pre-Enrollment Pre-Enrollment
3 XX123 First Enrollment 1. Enrollment
4 XX123 First Enrollment 1. Enrollment
5 XX123 First Enrollment 1. Enrollment
6 XX123 First Enrollment 1. Enrollment
7 XX123 First Enrollment 1. Enrollment
8 XX123 First Unenrollment 1. Unenrollment
9 XX123 First Unenrollment 1. Unenrollment
10 XX123 First Unenrollment 1. Unenrollment
11 XX123 First Unenrollment 1. Unenrollment
12 XX123 Second Enrollment 2. Enrollment
13 XX123 Second Enrollment 2. Enrollment
14 XX123 Second Enrollment 2. Enrollment
15 XX123 Second Enrollment 2. Enrollment
16 XX123 Second Enrollment 2. Enrollment
17 XX123 Second Enrollment 2. Enrollment
18 XX123 Second Enrollment 2. Enrollment
19 XX123 Second Unenrollment 2. Unenrollment
20 XX123 Third Enrollment 3. Enrollment
21 XX123 Third Enrollment 3. Enrollment
22 XX123 Third Enrollment 3. Enrollment
23 XX123 Third Enrollment 3. Enrollment
24 XX123 Third Enrollment 3. Enrollment
df <- structure(list(Month = c("1/1/2020", "2/1/2020", "3/1/2020",
"4/1/2020", "5/1/2020", "6/1/2020", "7/1/2020", "8/1/2020", "9/1/2020",
"10/1/2020", "11/1/2020", "12/1/2020", "1/1/2021", "2/1/2021",
"3/1/2021", "4/1/2021", "5/1/2021", "6/1/2021", "7/1/2021", "8/1/2021",
"9/1/2021", "10/1/2021", "11/1/2021", "12/1/2021"), ID = c("XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123"), Elig_Flag = c("N", "N", "Y", "Y", "Y", "Y",
"Y", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N",
"Y", "Y", "Y", "Y", "Y"), Continuous_Elig_Counter = c(0L, 0L,
1L, 2L, 3L, 4L, 5L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
0L, 1L, 2L, 3L, 4L, 5L), Enrollment_Span_Detail = c("Pre-Enrollment",
"Pre-Enrollment", "First Enrollment", "First Enrollment", "First Enrollment",
"First Enrollment", "First Enrollment", "First Unenrollment",
"First Unenrollment", "First Unenrollment", "First Unenrollment",
"Second Enrollment", "Second Enrollment", "Second Enrollment",
"Second Enrollment", "Second Enrollment", "Second Enrollment",
"Second Enrollment", "Second Unenrollment", "Third Enrollment",
"Third Enrollment", "Third Enrollment", "Third Enrollment", "Third Enrollment"
)), class = "data.frame", row.names = c(NA, -24L))
我的贡献:使用
nombre
!完整拍摄:
library(tidyverse)
library(nombre)
# -----------------
df <- df %>%
mutate(.by = c(ID), esd = trunc(consecutive_id(Elig_Flag)/2)) %>%
mutate(
.by = c(ID, Elig_Flag),
esd = case_when(
Elig_Flag == "N" & esd == 0 ~ "Pre-Enrollment",
Elig_Flag == "N" & esd != 0 ~ paste0(nom_ord(esd), " Unenrollment"), # nombre::num_ord
Elig_Flag == "Y" ~ paste0(nom_ord(esd), " Enrollment")) %>%
str_to_title()) %>%
rename(Enrollment_Span_Detail = esd) # That's a big var name
输出:
> df
Month ID Elig_Flag Continuous_Elig_Counter expected_output Enrollment_Span_Detail
1 1/1/2020 XX123 N 0 Pre-Enrollment Pre-Enrollment
2 2/1/2020 XX123 N 0 Pre-Enrollment Pre-Enrollment
3 3/1/2020 XX123 Y 1 First Enrollment First Enrollment
4 4/1/2020 XX123 Y 2 First Enrollment First Enrollment
5 5/1/2020 XX123 Y 3 First Enrollment First Enrollment
6 6/1/2020 XX123 Y 4 First Enrollment First Enrollment
7 7/1/2020 XX123 Y 5 First Enrollment First Enrollment
8 8/1/2020 XX123 N 0 First Unenrollment First Unenrollment
9 9/1/2020 XX123 N 0 First Unenrollment First Unenrollment
10 10/1/2020 XX123 N 0 First Unenrollment First Unenrollment
11 11/1/2020 XX123 N 0 First Unenrollment First Unenrollment
12 12/1/2020 XX123 Y 1 Second Enrollment Second Enrollment
13 1/1/2021 XX123 Y 2 Second Enrollment Second Enrollment
14 2/1/2021 XX123 Y 3 Second Enrollment Second Enrollment
15 3/1/2021 XX123 Y 4 Second Enrollment Second Enrollment
16 4/1/2021 XX123 Y 5 Second Enrollment Second Enrollment
17 5/1/2021 XX123 Y 6 Second Enrollment Second Enrollment
18 6/1/2021 XX123 Y 7 Second Enrollment Second Enrollment
19 7/1/2021 XX123 N 0 Second Unenrollment Second Unenrollment
20 8/1/2021 XX123 Y 1 Third Enrollment Third Enrollment
21 9/1/2021 XX123 Y 2 Third Enrollment Third Enrollment
22 10/1/2021 XX123 Y 3 Third Enrollment Third Enrollment
23 11/1/2021 XX123 Y 4 Third Enrollment Third Enrollment
24 12/1/2021 XX123 Y 5 Third Enrollment Third Enrollment
玩具数据由@Andre_Wildberd提供
df <- structure(list(Month = c("1/1/2020", "2/1/2020", "3/1/2020",
"4/1/2020", "5/1/2020", "6/1/2020", "7/1/2020", "8/1/2020", "9/1/2020",
"10/1/2020", "11/1/2020", "12/1/2020", "1/1/2021", "2/1/2021",
"3/1/2021", "4/1/2021", "5/1/2021", "6/1/2021", "7/1/2021", "8/1/2021",
"9/1/2021", "10/1/2021", "11/1/2021", "12/1/2021"), ID = c("XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123", "XX123", "XX123", "XX123", "XX123", "XX123",
"XX123", "XX123"), Elig_Flag = c("N", "N", "Y", "Y", "Y", "Y",
"Y", "N", "N", "N", "N", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "N",
"Y", "Y", "Y", "Y", "Y"), Continuous_Elig_Counter = c(0L, 0L,
1L, 2L, 3L, 4L, 5L, 0L, 0L, 0L, 0L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
0L, 1L, 2L, 3L, 4L, 5L), Enrollment_Span_Detail = c("Pre-Enrollment",
"Pre-Enrollment", "First Enrollment", "First Enrollment", "First Enrollment",
"First Enrollment", "First Enrollment", "First Unenrollment",
"First Unenrollment", "First Unenrollment", "First Unenrollment",
"Second Enrollment", "Second Enrollment", "Second Enrollment",
"Second Enrollment", "Second Enrollment", "Second Enrollment",
"Second Enrollment", "Second Unenrollment", "Third Enrollment",
"Third Enrollment", "Third Enrollment", "Third Enrollment", "Third Enrollment"
)), class = "data.frame", row.names = c(NA, -24L))
df <- rename(expected_output = Enrollment_Span_Detail)