我是 R 新手。我习惯了 VB,其中大量使用循环,但我知道如果我可以向量化数据,R 会更高效。我不知道是否可以对我在这里构建的内容进行矢量化。
总体思路是,对于
n=3:N
:
n
)中随机抽取大小为N
的随机样本(无需放回)B
重新采样X
次X
参数估计值进行平均并检查收敛性(即查看估计值之间的标准差或类似内容 - 待定)。我还没有实现步骤4,所以下面的代码只执行步骤1:3。第 4 步应该足够简单,可以使用
rowMeans()
在循环外执行。
注意:我将 B 和 X 设置为 100 进行测试,但最终使用时需要两者等于 10,000(或更多)
# simulate observation of N=30
bdf <- data.frame(sample(8:13, 30, rep = TRUE)
# get number of observations
N <- length(bdf)
# set number of bootstrap replicates
B <- 100
# set number of times to repeat the estimate
X <- 100
# create empty storage container for results
result_vec <- vector(length=B)
# this loop iterates over the number of times to repeat the estimate
for (j in 1:X) {
# this loop iterates sample size from n=2 to n=N
for (i in 3:N) {
# random sample of size n
boot_samp <- bdf[sample(N, size=i, replace=FALSE)]
# this loop does the bootstrap sampling
for(b in 1:B) {
# draw a bootstrap sample
bsamp <- sample(boot_samp, size=i, replace=TRUE)
# calculate your parameter
p <- mean(bsamp)
#p <- sd(bsamp)
# save the calculated parameter
result_vec[b] <- p
}
if (i==3) {
# initiate data frame and store the results for n=2 parameter estimate
df_res <- data.frame(result_vec)
}
else {
# add the results for n=i parameter estimate to the data frame
df_temp <- data.frame(result_vec)
df_res <- cbind(df_res, df_temp)
}
# rename the column in the data frame as n=i
names(df_res)[ncol(df_res)] <- paste("n = ",i)
}
# calculate the mean of the parameter estimates
allmeans <- colMeans(df_res)
if (j==1) {
# initiate a new data frame to store the means
df_means <- data.frame(allmeans)
}
else {
# add the results to the existing data frame
df_temp <- data.frame(allmeans)
df_means <- cbind(df_means, df_temp)
}
# rename the column in the data frame with j
names(df_means)[ncol(df_means)] <- j
}
所有循环都可以用几个
mapply
调用来替换。由于最内层的循环是通过替换完成的,因此可以一次完成所有样本并将其放入矩阵中 rowMeans
。
# simulate observation of N=30
bdf <- data.frame(sample(8:13, 30, rep = TRUE))
# get number of observations
N <- nrow(bdf)
# set number of bootstrap replicates
B <- 1e4
# set number of times to repeat the estimate
X <- 100
# this loop iterates over the number of times to repeat the estimate
system.time({
df_means <- mapply(
\(j) colMeans(
mapply(
\(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
)
), 1:X
)
dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
})
#> user system elapsed
#> 21.58 1.39 22.98
此外,这个过程非常容易并行运行:
library(parallel)
X <- 1e3
system.time({
cl <- makeCluster(detectCores() - 1) # 15 cores
clusterExport(cl, c("bdf", "N", "B"))
df_means <- simplify2array(
parLapply(cl, 1:X, \(j) colMeans(
mapply(
\(i) rowMeans(matrix(sample(sample(bdf[,1], i), B*i, 1), B, i)), 3:N
)
))
)
dimnames(df_means) <- list(paste0("n", 3:N), paste0("j", 1:X))
})
#> user system elapsed
#> 0.02 0.14 22.08
在我老化的笔记本电脑上并行执行
B = X = 10000
只需不到 4 分钟。