我正在尝试为数据集上所有可能的因素组合拟合线性混合模型。由于运行时间太长,我真的需要优化我的代码。
为了说明我在做什么,我们以
sleepstudy
包中的 lme4
数据为例:
# Load required packages
library(lmerTest)
library(data.table)
library(emmeans)
library(multcomp)
# Prepare example data
sleepstudy <- data.table(sleepstudy)
sleepstudy[, Subject := as.character(Subject)]
sleepstudy
包含 18 名受试者,在不同日期重复测量反应时间。为了简单起见,我将仅使用前 9 个主题:
# Get unique subjects - use only the 9 first subjects to prevent code from taking forever to run
unique_subjects <- unique(sleepstudy$Subject)
unique_subjects <- unique_subjects[1:9]
接下来,我使用上一步中选择的 9 个主题的所有可能组合创建一个向量。因此,我必须在 502 个独特的主题组合上拟合混合线性模型:
# Generate all possible combinations of subjects
combinations <- lapply(2:length(unique_subjects), function(n) combn(unique_subjects, n, simplify = TRUE))
combinations <- do.call(c, lapply(combinations, asplit, 2))
最后,对于 502 个主题组合中的每一个,我都会执行以下操作: - 从原始表中提取当前组合的子集; - 拟合线性模型; - 估计组平均值并准备一个结果表,该表在用字母 a、b、c 以及 ab、bc 等组合表示的受试者之间具有统计显着性
# Fit models for each combination
res <- lapply(1:length(combinations), function(i) {
# subset table for specific subject combination
subset_dt <- sleepstudy[Subject %in% combinations[[i]]]
# fit model
lmer_model <- lmer(Reaction ~ Subject + (1|Days), data = subset_dt)
# create table including letter display of all pair-wise comparisons
lw.comp <- data.table(cld(emmeans(lmer_model, pairwise ~ Subject),
level=0.95, Letters=LETTERS))
# create new column with list of subjects taken into account in the model
setcolorder(lw.comp[, list_subjects := paste(combinations[[i]], collapse=",")], c('list_subjects'))
return(lw.comp)
})
# combine results
(comb.dt <- rbindlist(res))
但是,运行时间太长。也许在这个玩具示例中没有那么多,但我的真实数据集总共有超过 50k 个组合。因此,我需要找到一些方法来优化这段代码。
是否有任何本机
data-table
解决方案可以“内部”运行此解决方案,而无需使用 lapply 对每个组合进行子集化?
我不能对他的特定代码使用任何
mclapply
,因为我已经使用它来并行化我必须从中提取主题组合的所有数据文件的外循环。
tl;dr 我有一些技巧可以让
lme4
little 更快一点,但是:
emmeans
调用占用的,而不是由 lmer
占用。
5e4*150*0.001/60
=125 分钟。对于实质性数据分析来说,这真的太长了吗? (同样,您的实际示例的个体拟合可能比这更大/更慢......)
refit()
选项使用不同的响应变量向量重新拟合模型。这仅在这种情况下有效,因为有一个非常具体的结构,即固定效应协变量的向量对于每个主题都是相同的(即设计是平衡的),并且主题是“可交换的”。另请注意,在这种情况下,我们必须传递一个与我们要重新拟合的子集具有相同数据维度的基本模型,因此我们必须为具有特定属性的每组组合执行单独的循环(或嵌套循环)。科目数量...仅返回系数,在
emmeans
ff1 <- function(i, start = NULL, base_model = NULL, refit = FALSE, coefs_only = FALSE) {
## subset table for specific subject combination
subset_dt <- sleepstudy[Subject %in% combinations[[i]]]
## fit model
if (!refit) {
lmer_model <- update(base_model, data = subset_dt, start = start)
} else {
lmer_model <- refit(base_model, newresp = subset_dt$Reaction)
}
if (coefs_only) return(fixef(lmer_model))
## create table including letter display of all pair-wise comparisons
lw.comp <- data.table(cld(emmeans(lmer_model, pairwise ~ Subject),
level=0.95, Letters=LETTERS))
## create new column with list of subjects taken into account in the model
setcolorder(lw.comp[, list_subjects := paste(combinations[[i]], collapse=",")], c('list_subjects'))
return(lw.comp)
}
fit base model
lmer1 <- lmer(Reaction ~ Subject + (1|Days),
data = subset(sleepstudy, Subject %in% combinations[[1]]))
##
library(microbenchmark)
microbenchmark(baseline = ff1(1, base_model = lmer1),
good_start = ff1(1, base_model = lmer1, start = getME(lmer_base,"theta")),
refit = ff1(1, base_model = lmer1, refit = TRUE),
coefs_only = ff1(1, base_model = lmer1, refit = TRUE, coefs_only = TRUE)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
baseline 71.355103 73.124095 75.752385 74.771380 77.15065 93.44492 100 a
good_start 69.842636 72.537420 75.942336 74.816134 77.87323 91.91398 100 a
refit 55.134464 56.712845 58.763176 58.028897 60.14157 75.98236 100 b
coefs_only 6.284897 6.937626 7.233845 7.142845 7.38023 11.42668 100 c
目前的结论:
设置起始值并没有多大帮助(在这种情况下);
refit
仅返回系数(跳过
emmeans
## refit base model to last combination (all 9 subjects)
lmer2 <- update(lmer_base,
data = subset(sleepstudy, Subject %in% combinations[[502]]))
microbenchmark(baseline = ff1(502, base_model = lmer2),
good_start = ff1(502, base_model = lmer2, start = getME(lmer2,"theta")),
refit = ff1(502, base_model = lmer2, refit = TRUE),
coefs_only = ff1(502, base_model = lmer2, refit = TRUE, coefs_only = TRUE)
)
Unit: milliseconds
expr min lq mean median uq max neval cld
baseline 146.370169 150.685938 157.295938 154.291821 157.025863 322.54636 100 a
good_start 145.514585 151.198449 153.941184 154.273386 156.794318 168.84127 100 a
refit 127.749710 132.707177 136.826251 137.163430 140.308683 152.99645 100 b
coefs_only 6.386507 7.239992 7.515005 7.456434 7.604491 12.60484 100 c
需要基线两倍的时间,但所有这些都来自 emmeans
计算 — 如果我们跳过它(使用 coefs_only
),我们将回到 7.5 毫秒计时!
剖析我最初通过分析代码意识到
emmeans
source()
...
Rprof("lmerprof.out")
x <- replicate(100, ff1(1, refit_model = lmer_base))
Rprof(NULL)
source("https://raw.githubusercontent.com/noamross/noamtools/master/R/proftable.R")
proftable("lmerprof.out", 20)
PctTime Call
11.604 data.table > cld > emmeans > do.call > <Anonymous> > try > tryCatch > tryCatchList > tryCatchOne > doTryCatch > recover_data > .get.outside.method > utils::getAnywhere > find > %in% > ls
8.874 data.table > cld > emmeans > do.call > <Anonymous> > emm_basis > .get.outside.method > utils::getAnywhere > find > %in% > ls
8.874 data.table > cld > emmeans > do.call > <Anonymous> > emm_basis > utils::getAnywhere > find > %in% > ls
5.461 data.table > cld > emmeans > do.call > <Anonymous> > try > tryCatch > tryCatchList > tryCatchOne > doTryCatch > recover_data > utils::getAnywhere > find > %in% > ls
4.096 data.table > cld > emmeans > do.call > <Anonymous> > try > tryCatch > tryCatchList > tryCatchOne > doTryCatch > recover_data > recover_data.merMod > utils::getAnywhere > find > %in% > ls
0.683 data.table > cld > emmeans > apply > fac.reduce
0.683 data.table > cld > emmeans > do.call > <Anonymous>
0.683 data.table > cld > emmeans > do.call > <Anonymous> > emm_basis > emm_basis.merMod > pbkrtest::vcovAdj.lmerMod > vcovAdj_internal > * > new > initialize > callNextMethod > .nextMethod > validObject >
0.683 data.table > cld > emmeans > do.call > <Anonymous> > emm_basis > emm_basis.merMod > pbkrtest::vcovAdj.lmerMod > vcovAdj_internal > + > new > initialize > callNextMethod > .nextMethod > validObject >
0.683 data.table > cld > emmeans > do.call > <Anonymous> > try > tryCatch > tryCatchList > tryCatchOne > doTryCatch > recover_data > recover_data.merMod > .has.fcns > setdiff > as.vector > .all.vars > gsu
0.683 refit > refit.merMod > <Anonymous> > do.call > methods::new > initialize
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > .mcletters > multcompView::multcompLetters > vec2mat2 > t > as.matrix > as.matrix.data.frame
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > .mcletters > multcompView::multcompLetters > vec2mat2 > t > as.matrix > as.matrix.data.frame > vapply
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > .mcletters > unlist > substr
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > .mcletters > unlist > match.fun
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > <Anonymous> > getClass > getClassDef > .getClassesFromCache
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > contrast > contrast.emmGrid > t > t.data.frame > as.matrix > as.matrix.data.frame > as.list
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > do.call > <Anonymous> > test.emmGrid > summary > summary.emmGrid > .est.se.df > t > apply > pbkrtest::Lb_ddf > [ > .subscript.2ary > .
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > do.call > <Anonymous> > test.emmGrid > summary > summary.emmGrid > .est.se.df > t > apply > pbkrtest::Lb_ddf > * > .Ops.checkDimNames
0.341 data.table > cld > cld.emm_list > multcomp::cld > cld.emmGrid > do.call > <Anonymous> > test.emmGrid > summary > summary.emmGrid > .est.se.df > t > apply > pbkrtest::Lb_ddf > * > new > initialize >
Parent Call: replicate > sapply > lapply > FUN > ff1 > ...
Total Time: 5.86 seconds
Percent of run time represented: 46.1 %>