R 中的引导 - 每个样本包含多行

Question

通过示例数据框

pay

，我使用基本 R 进行引导。与经典引导的主要区别在于，一个样本可以有多行，必须全部包含在内。

pay

中有7个ID，因此我的目标是创建一个长度为7的替换样本，并创建一个包含采样ID的新数据集

resample

。

我的代码目前可以工作，但鉴于我的数据中有 100 万行以及引导程序所需的多次重复，效率很低。

创建

pay

：

ID    <- c(1,1,1,2,3,3,4,4,4,4)
level <-  c(1:10)
pay <- data.frame(ID = ID,level =  level)

我用于创建单个重采样数据集的（低效）代码：

IDs <- levels(as.factor(ID))
samp <- sample(IDs, length(IDs) , replace = TRUE)
resample <- numeric(0)

for (i in 1:length(IDs))        
    {
temp <-  pay[pay$ID == samp[i], ]
resample <- rbind(resample, temp) 
    }

结果：

 samp
[1] "1" "2" "3" "1"


 resample
  ID level
1  1   0.5
2  1  -2.0
3  1   3.0
4  2   4.0
5  3   5.0
6  3   6.0
7  1   0.5
8  1  -2.0
9  1   3.0

我认为最慢的部分是每次迭代都扩展

resample

。但是，我不知道最后会有多少行。非常感谢你的帮助。

Answer 1

您可以通过以下方式对行进行采样

pay[sample(seq_len(nrow(pay)), replace=TRUE),]

看起来相当有效。

> system.time({
+   for (i in 1:10000)
+     pay[sample(seq_len(nrow(pay)), replace=TRUE),]
+ })
   user  system elapsed
  0.469   0.002   0.473

编辑：

根据下面 Dudelstein 的评论，上述内容是不正确的。这是解决我认为您所要求的问题的方法。

samp <- sample(unique(ID), replace=TRUE)
do.call(rbind, lapply(samp, function(x) pay[pay$ID == x,]))

基准测试，与原始方法相比，它似乎（大约）快了三分之一。我确信有更好的方法。

Answer 2

我最近不得不自己用一个大数据框来做这件事，我发现@Josh的代码效率低下，以至于在引导程序中使用完全不切实际。

相反，我编写了以下代码，这似乎将计算时间减少到了微不足道的程度：

# Draw a sample of IDs from the data frame
# Length of sample is equal to the number of unique IDs in your data frame
samp <- sample(unique(df$id), length(unique(df$id)), replace=TRUE)

# Create a data frame tracking number of occurrences of IDs in the sample  
df_table <- as.data.frame(table(samp))
df_table$samp <- as.numeric(levels(df_table$samp))[df_table$samp]

# Initialize some variables for the loop that creates the bootstrap data frame  
a <- 1
df_boot <- data.frame()
  
while(a <= max(df_table$Freq)){
    
  id_boot <- df_table[df_table$Freq >= a, 1]
  df_boot <- rbind(df_boot, df[df$id %in% id_boot, ])
  a <- a + 1
    
}

这里的技巧是，我们直接从数据框中索引行，为 R 提供精确的行进顺序，而不是告诉它扫描整个数据框以查找样本中每个 ID 的位置，这就是 @Josh ' s代码正在做。如果您的数据框有 5,000 个唯一 ID 和 10,000 行数据，最终会告诉 R 搜索 5000 x 10000 = 50,000,000 行数据进行计算，这样您就可以明白为什么它可能需要大量时间才能完成，这对于引导程序来说是不切实际的，引导程序通常需要您重复代码数千次。

相反，通过使用

df[df$id %in% id_boot, ]

，我们可以准确地告诉 R 我们想要提取哪些数据行，而不需要它扫描任何内容，因此我们只处理包含我们想要的数据的确切行，而不会浪费任何计算能力任何与我们要查找的内容不匹配的数据行。

我能够在包含 10,000 行数据的数据帧上运行此代码，并在大约 1-2 秒内完成操作，而 @Josh 的代码需要近一分钟才能完成。

R 中的引导 - 每个样本包含多行

问题描述投票：0回答：2

2个回答

最新问题

R 中的引导 - 每个样本包含多行

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2