我目前正在研究一个庞大的医学数据集,其中存储了不同变量的多个测量值(最多 6 个)。我制作了一个长表,因为我希望计算这些变量之间的相关性。然而,我遇到的问题是,A,并非所有参与者对每个变量都有相同数量的测量值(有些参与者有 6 个测量值,有些有 1 个),B 并非所有参与者对每个变量都有相同的测量值(VarA 可能有 3 个测量值,而 varB 可能有 3 个测量值)可能有 1) 和 C 最后也是最大的问题,变量是时间敏感的,因此,只有在彼此的 x 时间内记录的变量才可以被认为是可用的。
问题是 VarA 测量 1 可能与 VarB 测量 2 甚至 3 一致。因此,我需要检查每个参与者的测量日期,计算这些测量日期总和的相反值,并重新组织数据集。
我想从此开始:
对此:
我已经尝试过了
# initialization
time_period <- 0
time_list <- vector("numeric", length = nrow(dates_long))
ref_date <- dates_long$date_[1]
for (i in 1:nrow(dates_long)) {
if (i == 1 || dates_long$Id[i] != dates_long$.Id[i - 1]) {
ref_date <- as.Date(dates_long$date_[i], "%d-%m-%Y")
time_period <- 0
}
date <- as.Date(dates_long$date_[i], "%d-%m-%Y")
# Check for missing values in date
if (!is.na(date)) {
if (as.numeric(date - ref_date, units = "days") <= 45) {
time_list[i] <- time_period
} else {
time_period <- time_period + 1
time_list[i] <- time_period
ref_date <- date
}
} else {
# Handle missing date values, for example, set time_list[i] to a specific value
time_list[i] <- NA
}
}
dates_long$time_list <- time_list
使用此代码,我尝试将每个 ID 的每个匹配对分配给 time_list。第一场比赛是时间 1,第二场比赛是 2,依此类推。我希望之后我可以根据本栏重新排列。
然而,遗憾的是这不起作用。希望各位聪明人都能知道答案。
亲切的问候
编辑:示例数据帧的输出:
dates_long <- structure(list(Participant.ld = c(1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), varA_date = structure(c(16576,
16942, 17308, 17674, 18040, 18407, 16160, 16912, 17308, NA, NA,
NA, 16576, 16942, 17308, 17674, 18040, 18407), class = "Date"),
varA_value = c(10L, 20L, 30L, 40L, 50L, 60L, 11L, 22L, 33L,
NA, NA, NA, NA, 44L, NA, NA, NA, NA), varB_date = structure(c(16942,
17674, 18040, NA, NA, NA, 16942, 17308, NA, NA, NA, NA, 16952,
NA, NA, NA, NA, NA), class = "Date"), varB_value = c(100L,
200L, 300L, 400L, 500L, 600L, 111L, 222L, NA, NA, NA, NA,
555L, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-18L))
一次尝试 - 不确定这正是您所追求的,但在没有更完整的问题和可重复答案的情况下,这是我的尝试:
library(tidyverse)
# first create a true-r long form dataset:
dates_longer <- dates_long |> pivot_longer(cols = -c(Participant.ld),
names_to = c("Variable", ".value"),
names_sep = "_")
# next, establish our tolerance:
tolerance <- days(31)
# now do the main work:
dates_longer %>%
group_by(Participant.ld) %>%
arrange(date, .by_group = TRUE) %>%
ungroup() %>%
mutate(datelag = date - lag(date, 1, default = date[1]-(tolerance+days(1))),
# gives us a rolling "days since last entry"
cluster = cumsum(datelag > tolerance | is.na(datelag))) %>%
# starts a new cluster every time the gap is either greater than the tolerance
# or if there's an NA date
pivot_wider(names_from = Variable,
names_glue = "{Variable}_{.value}",
values_from = c(value, date)) %>%
# now pivoted back to the original format
group_by(Participant.ld, cluster) %>%
fill(starts_with("var"), .direction = "downup") %>%
slice_head(n=1) %>%
# cuts the set to one consolidated row for each cluster
select(-c(datelag, cluster))
这给出:
# Groups: Participant.ld, cluster [30]
cluster Participant.ld varA_value varB_value varA_date varB_date
<int> <int> <int> <int> <date> <date>
1 1 1 10 NA 2015-05-21 NA
2 2 1 20 100 2016-05-21 2016-05-21
3 3 1 30 NA 2017-05-22 NA
4 4 1 40 200 2018-05-23 2018-05-23
5 5 1 50 300 2019-05-24 2019-05-24
6 6 1 60 NA 2020-05-25 NA
7 7 1 NA 400 NA NA
8 8 1 NA 500 NA NA
9 9 1 NA 600 NA NA
10 10 2 11 NA 2014-03-31 NA
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows