如何在 R 中使这个匹配函数更快？目前需要6-7天，这不切实际

Question

我有两个数据文件要开始：一个是暴露个体的队列（100000 行），另一个是跨越 5 年时间段的一般人群队列（约 300 万行）。我正在尝试创建一个匹配函数，对于我暴露的队列中的每个人，将从一般人群队列中随机选择 5 个年龄和性别匹配的个体。在暴露的个体之间选择 5 个匹配是带有替换的。然后，这些随机选择的个体将填充第三个数据表。我之前尝试使用 matchit 包，但它无法完全满足我的需要，这就是为什么我尝试从头开始创建此代码。

我编写的代码可以工作，但是因为它是一个 for 循环，并且我的样本量非常大，所以花费的时间太长，并且这不是一个实用的解决方案。如果您有任何加快速度的想法，请帮忙！

我现在的代码如下：

find_matches <- function(exposed.cohort, unexposed.cohort) {
#create an empty list to store the matches 
 matched.data <- data.table() 

#iterate over each row to find matches 
  for (i in 1:nrow(exposed.cohort)) {
    exposed_person <- exposed.cohort[i]
    potential_matches <- unexposed.cohort[xxxxxxxx here is a long logical statement of which     conditions the potential matches need to be met to be selected xxxxxx]

    #randomly sample 5 without replacement 
    if (nrow(potential_matches)) > 5 {
      matched_data <- potential_matches[sample(.N,5),]}
    else {
      matched_data <- potential_matches 
    }

    #add identifier 
    matched_data[, matchID := exposed_person$ID]

    #store results
    matched.data <- rbind(matched.data, matched_data)
    i <- i+1 
  }
  return(matched.data)
}

上面的代码可以工作，但我需要以一种可以加快进程的方式编写它（如果可能的话）。另外，有没有一种方法可以让我在运行时看到matched.data文件输出，以便在R被中断/崩溃/我必须提前停止它的情况下，我仍然可以在matched.data输出中看到进度保存在全局环境中，我不必完全重新启动？

任何帮助将不胜感激，我的截止日期很紧，我开始有点害怕了！谢谢你。

Answer 1

这在我的机器上大约需要一秒钟才能与您的问题的大小进行匹配：

set.seed(42)
n_e = 1e5;
n_g = 3e6
exposed <- data.frame(id = 1:n_e, 
                      age = sample(18:100, n_e, TRUE, prob = 100:18),
                      sex = sample(c("F","M"), n_e, TRUE))

genpop <- data.frame(id = 1:n_g, 
                     some_value = sample(0:1000, size = n_g, TRUE),
                      age = sample(18:100, n_g, TRUE, prob = 100:18),
                      sex = sample(c("F","M"), n_g, TRUE))


library(tidyverse)

tictoc::tic()
exposed |>
  uncount(5) |>
  mutate(match_val = runif(n())) |>
  left_join(
    genpop |> mutate(match_val = runif(n())),
    join_by(age, sex, closest(match_val >= match_val))
    )

tictoc::toc()

如何在 R 中使这个匹配函数更快？目前需要6-7天，这不切实际

问题描述投票：0回答：1

1个回答

最新问题

如何在 R 中使这个匹配函数更快？目前需要6-7天，这不切实际

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1