我有一个工作函数,其编码是为了优化并行处理(希望如此)。我对
R
还不是最熟练的,尤其是函数和迭代。
我希望有人可以帮助我优化我编写的函数以及额外的代码,以帮助缩短计算时间并完全优化并行处理选项。
具体使用
%do%
与 %dopar%
并将附加代码和并行处理函数移至函数内部。我似乎无法让 %dopar%
工作,我不确定这是否是我的代码、R
版本或冲突库的问题。
我将非常感谢任何有关以更有效的方式获得相同结果的可能方法的建议。
背景:
我正在使用
dismo::gbm.step
构建 gbm
模型。 gbm.step
通过k折交叉验证选择最佳树数。然而,树的复杂度和学习率的参数仍然需要设置。我知道 caret::train
是专门针对此任务构建的,并且我在学习 caret
时获得了很多乐趣,尤其是它的自适应重采样功能。然而,我的回答是二项式的,并且 caret
没有选项返回二项式分布的 AUC;我想使用 AUC 来复制我的领域(生态学)中类似的已发表研究。
我还在稍后的分析中使用
dismo::gbm.simplify
来识别可能的简化模型。 gbm.simplify
依赖于在 dismo
中构建模型时创建的数据,我无法让它在 caret
中构建的模型上工作。
最后,大多数
gbm
生态学文献都遵循 Elith 等人中描述的方法。 2008 年“增强回归树工作指南”,这是 dismo
中的 BRT 函数的基础。出于本研究的目的,我想继续使用 dismo
来构建 gbm
模型。
我编写的函数测试了
tree.complexity
和 learning.rate
的多种组合,并返回每个模型的几个性能指标的列表。然后,我将所有 lists
组合成 data.frame
以便于排序。
功能目标
gbm
的每次迭代创建一个 tree.complexity
模型,并且
learning.rate
。$self.statistics$discrimination
,
cv.statistics$discrimination.mean
、self.statistics$mean.resid
、
以及每个 cv.statistics$deviance.mean
模型的 list
中的 gbm
已创建。gbm
模型以节省空间。可重现的示例 使用
Anguilla_train
包中的 dismo
数据集
#Load libraries
require(pacman)
p_load(gbm, dismo, TeachingDemos, foreach, doParallel, data.table)
data(Anguilla_train)
#Identify cores on current system
cores<-detectCores(all.tests = FALSE, logical = FALSE)
cores
#Create training function for gbm.step
step.train.fx=function(tree.com,learn){
#set seed for reproducibility
char2seed("StackOverflow", set = TRUE)
k1<-gbm.step(data=Anguilla_train,
gbm.x = 3:13,
gbm.y = 2,
family = "bernoulli",
tree.complexity = tree.com,
learning.rate = learn,
bag.fraction = 0.7,
prev.stratify=TRUE,
n.folds=10,
n.trees=700,
step.size=25,
silent=TRUE,
plot.main = FALSE,
n.cores=cores)
k.out=list(interaction.depth=k1$interaction.depth,
shrinkage=k1$shrinkage,
n.trees=k1$n.trees,
AUC=k1$self.statistics$discrimination,
cv.AUC=k1$cv.statistics$discrimination.mean,
deviance=k1$self.statistics$mean.resid,
cv.deviance=k1$cv.statistics$deviance.mean)
return(k.out)
}
#define complexity and learning rate
tree.complexity<-c(1:5)
learning.rate<-c(0.01,0.025,0.005,0.0025,0.001)
#setup parallel backend to use n processors
cl<-makeCluster(cores)
registerDoParallel(cl)
#Run the actual function
foreach(i = tree.complexity) %do% {
foreach(j = learning.rate) %do% {
nam=paste0("gbm_tc",i,"lr",j)
assign(nam,step.train.fx(tree.com=i,learn=j))
}
}
#Stop parallel
stopCluster(cl)
registerDoSEQ()
#disable scientific notation
options(scipen=999)
#Find all item in workspace that contain "gbm_tc"
train.all<-ls(pattern="gbm_tc")
#cbind each list that contains "gbm_tc"
train.results<-list(do.call(cbind,mget(train.all)))
#Place in a data frame
train.results<- do.call(rbind, lapply(train.results, rbind))
train.results <- data.frame(matrix(unlist(train.results),ncol=7 , byrow=T))
#Change column names
colnames(train.results)<-c("TC","LR","n.trees", "AUC", "cv.AUC", "dev", "cv.dev")
#Round 4:7
train.results[,4:7]<-round(train.results[,4:7],digits=3)
#Sort by cv.dev, cv.AUC, AUC
train.results<-train.results[order(train.results$cv.dev,-train.results$cv.AUC, -train.results$AUC),]
train.results
我仍在尝试自己解决如何做到这一点,你已经比我走得更远了!我想到的一件事是问题可能出在嵌套的
%do%
中?作为测试,为什么不尝试仅对 %dopar%
进行 j
,或者看看是否可以将 j
和 k
矩阵折叠成单个向量,可能是包含这两项的排列的列表,传递给gbm.step
?例如
tree.complexity = i[1],
learning.rate = i[2],
如果你成功了请告诉我!
编辑:另外,另一条潜在路线是从
这里出发的
%:%
。
foreach(tree.com = 1:5) %:% foreach(learn = c(0.01,0.025,0.005,0.0025,0.001)) %dopar% {
gbm.step ... return(list(...))}
如果您将
tree.com
和 learn
添加到列表中,那么它可能会输出这些值的一个很好的矩阵。另一种选择:
foreach(tree.com = 1:5, learn = c(0.01,0.025,0.005,0.0025,0.001) %dopar% {
gbm.step ... return(list(...))}
我已经使用您的一些代码完成了此操作,也来自这篇文章
p_load(gbm, dismo, TeachingDemos, foreach, doParallel, data.table)
#Create grid of hyperparameter and variable combinations
hyper_grid <- expand.grid(learning.rate = c(0.00005,
0.00001,
0.0005,
0.0001,
0.005,
0.001,
0.05,
0.01,
0.5,
0.1),
tree.complexity = seq(1, 3, 1),
bag.fraction = seq(0.2, 0.8, 0.05)
)
#Set up cluster to run in parallel
ncores <- detectCores()
cl <- makeCluster(ncores, outfile="FullGBMGridSearchListening.txt")
registerDoParallel(cl)
#Run grid search. Be warned, depending on how granular your grid is, this can take a very long time to run despite the higher efficiency of parallelisation.
system.time(hyper_grid_res <- foreach (i=1:nrow(hyper_grid),.packages=c('gbm','TeachingDemos','data.table'),.combine = rbind)
%dopar% {
char2seed("reproducibility", set = TRUE)#sets the seed for reproducibility
# train models
gbm.tune <- gbm.step(data=mod.data,
gbm.x = 3:13,
gbm.y = 2,
family = "bernoulli",
tree.complexity = hyper_grid$tree.complexity[i],
learning.rate = hyper_grid$learning.rate[i],
bag.fraction = hyper_grid$bag.fraction[i],
max.trees = 20000,
verbose = FALSE,
silent = TRUE,
plot.main = FALSE)
#Extract the info we need for model ranking
tree.complexity <- hyper_grid$tree.complexity[i]
learning.rate <- hyper_grid$learning.rate[i]
bag.fraction <- hyper_grid$bag.fraction[i]
n.trees <- gbm.tune$n.trees
CV_correlation <- gbm.tune$cv.statistics$correlation.mean
CV_deviance_explained <- (((gbm.tune$self.statistics$mean.null- gbm.tune$cv.statistics$deviance.mean)/gbm.tune$self.statistics$mean.null)*100)
print(i)#Keep track of progress in listener
#Combine desired outputs in a data.table
data.table(tree.complexity = tree.complexity,
learning.rate = learning.rate,
bag.fraction = bag.fraction,
n.trees = n.trees,
CV_correlation = CV_correlation,
CV_deviance_explained = CV_deviance_explained)
}
)
stopCluster(cl)
#Order by deviance explained > number of predictors > n.trees > tree.complexity and place into a new object (I recommend saving this as .csv for subsequent use)
Full_BRT_parallel_grid.search.results <- hyper_grid_res %>% dplyr::arrange(desc(CV_deviance_explained),desc(n.trees),tree.complexity)