在 R 数据集中包含/排除患者的代码

问题描述 投票:0回答:1

在我的 r 数据集中,我只希望同一患者出现在 30 天内。 患者可以在一周、一个月、一年内出现多次。 我只是不想让他们在30天内出现两次。 我的数据集非常大,因此很难检查代码是否正确。

r filter dataset
1个回答
0
投票

这是一个 data.table 实现,它将识别并删除 30 天内出现的具有相同 ID 的记录。

library(data.table)
library(magrittr)
library(lubridate)
library(collapse)

# Randomly generate 1,000 people with 5% chance of appointments on a given day
N <- 1000
p <- 0.05

set.seed(6453)
dt <- lapply(seq(ymd('2024-01-01'), ymd('2024-12-31'), by = '1 day'),
       function (x) data.table(id = (1:N)[runif(N) < p], day = x)) %>%
  rbindlist()
setorder(dt, id, day)

dt[, day_diff := as.numeric(day - L(day, g = id))]
dt[is.na(day_diff), day_diff := 0]
dt[, cum_diff := cumsum(day_diff), by = .(id)]

# Leave anything within 30 days, reset on > 30 days and repeat.
while(max(dt$cum_diff) > 30){
  dt[cum_diff > 30, cum_diff := cum_diff - min(cum_diff), by = .(id)]
}
# Keep the ones equal to 0, drop the rest as they are within 30 days.
keep <- dt[cum_diff == 0]

# Check it meets requirements
setorder(keep, id, day)
keep[, day_diff := as.numeric(day - L(day, g = id))] 
keep[, cum_diff := NULL]
nrow(keep[day_diff <= 30]) # = 0
© www.soinside.com 2019 - 2024. All rights reserved.