在R中创建匹配对

Question

我有大约20,000个案例的数据集，每个案例有3个可能的控件。案例和控件分别由ID变量唯一标识。可能的控件存在一些重叠，因为它们与SQL中的用例结合在一起时，不允许不替换就进行匹配。我已将数据导入R并将其视为一组〜20,000个案例和〜50,000个对照，以便针对数据集中协变量（例如年龄）匹配的每个案例仅选择1个对照。我想要的输出是第1列中的案例ID和第2列中匹配的控件ID。

我一直在尝试使用MatchIt包进行匹配，但是包（match.matrix）的输出是ID列表，这些ID与情况或控件不完全对应。该程序包具有一个名为get_matches的函数，该函数似乎将返回适当的输出，但是函数参数对我来说是不透明的-我无法弄清楚id_cols和getdata是什么。似乎没有关于如何使用MatchIt（或其他程序包）仅返回具有匹配控件ID的案例ID列表的教程。我正在使用马哈拉诺比斯距离，但不在乎实际距离测量或返回倾向得分。仅选择最能匹配每个案例的单个控件而不进行替换并返回带有匹配控件ID的案例ID的最佳封装和方法是什么？

对导入的数据进行采样（请注意某些可能匹配项之间的重叠）：

case_ID <- c(1,1,1,2,2,2,3,3,3,4,4,4)
control_ID <- c(5,6,7,8,9,10,5,6,7,11,12,13)
age <- c(12,12,12,56,56,56,12,12,12,62,62,62)
score <- c(7,7,7,3,3,3,7,7,7,9,9,9)
parity <- c(1,1,1,4,4,4,1,1,1,2,2,2)
retested <- c(1,1,1,0,0,0,1,1,1,1,1,1)

df <- cbind(case_ID, control_ID, age, score, parity, retested)

所需的输出（显示协变量）：

matched_case <- c(1,2,3,4)
matched_control <- c(5,8,6,11)
matched_age <- c(12,56,12,62)
matched_score <- c(7,3,7,9)
matched_parity <- c(1,4,1,2)
matched_retested <- c(1,0,1,1)

matched_df <- cbind(matched_case, matched_control, matched_age, matched_score, matched_parity, matched_retested)

Answer 1

我不确定100％，但是此代码至少对您的示例有效。根据您的评论，我得出结论，该样本并不能说明全部情况。这也将非常慢。但这可能是一个开始。

case_ID     <- c( 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
control_ID  <- c( 5, 6, 7, 8, 9,10, 5, 6, 7,11,12,13)
age         <- c(12,12,12,56,56,56,12,12,12,62,62,62)
score       <- c( 7, 7, 7, 3, 3, 3, 7, 7, 7, 9, 9, 9)
parity      <- c( 1, 1, 1, 4, 4, 4, 1, 1, 1, 2, 2, 2)
retested    <- c( 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1)

df <- data.frame(case_ID, control_ID, age, score, parity, retested)

df$unq <- apply(df[, 2:6], 1, paste, collapse = "")
r <- 1
#discarded <- character()
while (r < nrow(df)) {
  # find all discarded and remove
  discarded <- c(df[r, "unq"]) # accept current `r` but discard the coming ones
  discarded <- which(df$unq[(r+1):length(df)] == discarded)
  if(length(discarded) > 0) {
    discarded <- discarded + r
    df <- df[-discarded, ]
  }
  # Find further instances of this case and remove
  discarded <- c(df[r, "case_ID"])
  discarded <- which(df$case_ID[(r+1):length(df)] == discarded)
  if(length(discarded) > 0) {
    discarded <- discarded + r
    df <- df[-discarded, ]
  }
  # Next!
  r <- r+1
}

在R中创建匹配对

问题描述投票：0回答：1

1个回答

最新问题

在R中创建匹配对

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1