根据日期列将 R 中的数据帧拆分为每个 ID 的多行

Question

我在 R 中有一个数据框，如下所示：

library(dplyr)
library(lubridate)
library(tidyr)

set.seed(123)

# Create the dataset
num_rows <- 1000
fixed_start_date <- as.Date("2021-12-27")
end_date <- as.Date("2022-10-15")
p_na <- 0.6

# Generate random dates for end column
end_dates <- sample(seq(fixed_start_date, end_date, by = "days"), num_rows, replace = TRUE)

# Generate random dates for the test column
test_dates <- ifelse(runif(num_rows) > p_na,
                     sample(seq(fixed_start_date, end_date, by = "days"), num_rows, replace = TRUE),
                     NA)

# Convert test_dates to actual Date objects
test_dates <- as.Date(test_dates, origin = "1970-01-01")

# Create the data frame
data <- data.frame(
  ID = 1:num_rows,
  start_date = rep(fixed_start_date, num_rows),
  end_date = end_dates,
  test_date = test_dates
)

# Display the first few rows of the dataset
head(data)

我想要做的是将每个 ID 的数据分成几行，如果它们在 test_date 列中有一个日期。

例如，对于 ID 1，应将其分为两行：

一排有

ID 保持不变（即 1），

开始日期==“2021-12-27”，

end_date == "2022-02-23" 在本例中是 test_date 之前的日期，

test_date 应保持不变，即在本例中为 2022-02-24

这里还应该引入一个新的列，叫做exposure，它应该是0

另一行

ID 保持不变（即 1），

start_date == "2022-02-24" 现在是 test_date 值，

end_date 保持不变（即“2022-06-23”），

test_date 应保持不变，即在本例中为 2022-02-24

这里还应该引入一个新的列，叫做exposure，应该是1

对于 test_date 为 NA 的行，这些 ID 应该只有一行，所有列都与初始数据相同，但添加一个名为“exposure”的新列，该列应为 0

Answer 1

每个

ID

都是唯一的，并且只有两种曝光状态，因此不必担心分组，而是将数据集拆分为

is.na(test_date)

和

!is.na(test_date)

，复制后一组的行，然后更容易、更快将它们重新组合在一起。这是一个

data.table

方法：

library(data.table)
dat <- as.data.table(data)

# Create exposure group
exposed <- rbindlist(list(
    dat[!is.na(test_date)][, `:=`(
        test_date = test_date - 1,
        exposure = 0
    )],
    dat[!is.na(test_date)][, `:=`(
        exposure = 1
    )]
))

full_dat  <- rbindlist(list(
    exposed,
    dat[is.na(test_date)][, exposure := 0]
))  |> setorder(ID, test_date)

head(full_dat, 10)

#        ID start_date   end_date  test_date exposure
#     <int>     <Date>     <Date>     <Date>    <num>
#  1:     1 2021-12-27 2022-06-23 2022-02-23        0
#  2:     1 2021-12-27 2022-06-23 2022-02-24        1
#  3:     2 2021-12-27 2022-01-09 2022-08-23        0
#  4:     2 2021-12-27 2022-01-09 2022-08-24        1
#  5:     3 2021-12-27 2022-07-09 2022-09-02        0
#  6:     3 2021-12-27 2022-07-09 2022-09-03        1
#  7:     4 2021-12-27 2022-04-23       <NA>        0
#  8:     5 2021-12-27 2022-08-12 2022-09-12        0
#  9:     5 2021-12-27 2022-08-12 2022-09-13        1
# 10:     6 2021-12-27 2022-08-27 2022-01-06        0

或者，如果您想留在

tidyverse

这里有一个

dplyr

方法。我仍然调用您的数据

dat

，因为

data()

是

utils

包中的一个函数，默认情况下在启动时加载：

bind_rows(
    filter(dat, !is.na(test_date)) |>
        mutate(
            test_date = test_date - 1,
            exposure = 0
        ),
    filter(dat, !is.na(test_date)) |>
        mutate(exposure = 1)
)  |>
 bind_rows(
    filter(dat, is.na(test_date)) |>
        mutate(exposure = 0)
) |> arrange(ID, test_date)

这给出了相同的输出。

根据日期列将 R 中的数据帧拆分为每个 ID 的多行

问题描述投票：0回答：1

1个回答

最新问题

根据日期列将 R 中的数据帧拆分为每个 ID 的多行

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1