我很不幸地在编写代码,以便从之前的观察中每 30 天捕获第一个观察结果。因此,30 天的窗口将根据上次 30 天以上的新观察而重置。这将通过分组 ID 来实现。我发现这很难理解,所以我写了一个示例数据集和一个变量来标识我想要保留和删除的内容。另外,我目前正在试验的代码。
df <- data.frame(id = c("a","a","a","a","a",'a',"b","b","b","b","b","b"),
date = c('12/01/22','12/15/22','01/02/22','02/03/22','02/17/22','04/15/22',
'12/01/22','02/02/22','03/15/22','03/31/22','04/15/22','05/31/22'),
keep = c('keep','delete','keep','keep','delete','keep',
'keep','keep','keep','delete','keep','keep'))
cutoff <- 30
df.t <- df %>%
# if date is < cutoff days of first date, maintain the same group
# else create a new group
group_by(g= accumulate(date, ~ if (.y - .x < cutoff) .x else .y)) %>%
# for each group select the first row
slice_head(n = 1) %>%
# ungroup and remove grouping variable
ungroup()
也许其他人会,但我看不到绕过循环的方法。本质上,函数
my_f()
循环遍历日期,然后找到 30 天内(提前)的日期并将其删除。我使用 split()
通过 id
制作数据框列表,然后将该函数分别应用于每个数据框。如果您愿意,这将允许您通过 parallel::mclapply()
充分利用多个核心。
library(dplyr)
df <- data.frame(id = c("a","a","a","a","a",'a',"b","b","b","b","b","b"),
date = c('12/01/22','12/15/22','01/02/22','02/03/22','02/17/22','04/15/22',
'12/01/22','02/02/22','03/15/22','03/31/22','04/15/22','05/31/22'),
keep = c('keep','delete','keep','keep','delete','keep',
'keep','keep','keep','delete','keep','keep'))
df <- df %>%
mutate(date = as.Date(date, format="%m/%d/%y")) %>%
arrange(id, date)
sp_df <- split(df, df$id)
my_f <- function(x, cutoff=30){
j <- 2
while(j < nrow(x)){
x <- x %>% filter((date - date[j]) <= 0 | (date - date[j]) > cutoff)
j <- j+1
}
x
}
bind_rows(
lapply(sp_df, my_f),
.id="id")
#> id date keep
#> 1 a 2022-01-02 keep
#> 2 a 2022-02-03 keep
#> 3 a 2022-04-15 keep
#> 4 a 2022-12-01 keep
#> 5 b 2022-02-02 keep
#> 6 b 2022-03-15 keep
#> 7 b 2022-04-15 keep
#> 8 b 2022-05-31 keep
#> 9 b 2022-12-01 keep
创建于 2024-04-03,使用 reprex v2.0.2
类似这样的事情吗?
df |>
mutate(date = as.Date(date, format = '%m/%d/%y')) |>
arrange(date) |>
filter(date - lag(date, 1) > cutoff)
(请注意,您的样本数据不是按时间顺序排列的)