为什么GGPLOT2 95%CI,并预测95%CI人工计算的有什么不同?

问题描述 投票:3回答:1

我想知道为什么从线性混合效应模型计算95个%置信带时做GGPLOT2产生比当手动计算,例如较窄的频带通过这里confidence intervals on predictions以下奔Bolker的方法。也就是说,GGPLOT2给模型的不准确表示?

下面是使用sleepstudy数据集(修改,以在结构上类似于我工作的一个DF)重复的例子:

data("sleepstudy") # load dataset 
height <- seq(165, 185, length.out = 18) # create vector called height
Treatment <- rep(c("Control", "Drug"), 9) # create vector called treatment
Subject <- levels(sleepstudy$Subject) # get vector of Subject
ht.subject <- data.frame(height, Subject, Treatment) 
sleepstudy <- dplyr::left_join(sleepstudy, ht.subject, by="Subject") # Append df so that each subject has its own height and treatment
sleepstudy$Treatment <- as.factor(sleepstudy$Treatment)

生成模型,预测添加到原来的DF和情节

m.sleep <- lmer(Reaction ~ Treatment*height + (1 + Days|Subject), data=sleepstudy)
sleepstudy$pred <- predict(m.sleep)
ggplot(sleepstudy, aes(height, pred, col=Treatment)) + geom_smooth(method="lm")[2] 

以下计算方法Bolker置信区间

newdf <- expand.grid(height=seq(165, 185, 1),
                   Treatment=c("Control","Drug"))
newdf$Reaction <- predict(m.sleep, newdf, re.form=NA) 
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1+VarCorr(m.sleep)$Subject[1]
cmult <- 1.96

newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))

# plot confidence intervals
ggplot(newdf, aes(x=height, y=Reaction, colour=Treatment)) + 
geom_point() +
geom_ribbon(aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4)[2]
r ggplot2 lme4 mixed-models confidence-interval
1个回答
2
投票

有了一些调整,这似乎是一致的。置信区间确实较大,但不是巨大的大得多。请记住,ggplot是装修一个非常不同的模式;它通过治疗是拟合单独的线性(未线性混合)模型,忽略(1)的重复测量,(2)天的效果。

这似乎不可思议,以配合随机斜坡的模型,但没有群体水平的斜率(e.g.see here),所以我加Days的固定效果:

m.sleep <- lmer(Reaction ~ Treatment*height + Days +
                (1 + Days|Subject),
                data=sleepstudy)

我重组了绘图代码一点点:

theme_set(theme_bw())
gg0 <- ggplot(sleepstudy, aes(height, colour=Treatment)) +
    geom_point(aes(y=Reaction))+
    geom_smooth(aes(y=pred), method="lm")
  • 如果要计算置信区间(这是什么lm() / ggplot2做比较的),那么你可能不应该添加到VarCorr(m.sleep)$Subject[1]方差(从tvar1FAQ example变量是创建预测区间而非置信区间... )
  • 因为我有Days在上述模型中,我添加mean(sleepstudy$Days)到预测数据帧。
newdf <- expand.grid(height=seq(165, 185, 1),
                     Treatment=c("Control","Drug"),
                     Days=mean(sleepstudy$Days))
newdf$Reaction <- newdf$pred <- predict(m.sleep, newdf, re.form=NA) 
modmat <- model.matrix(terms(m.sleep), newdf)
pvar1 <- diag(modmat %*% tcrossprod(vcov(m.sleep), modmat))
tvar1 <- pvar1
cmult <- 1.96

newdf <- data.frame(newdf
,plo = newdf$Reaction-cmult*sqrt(pvar1)
,phi = newdf$Reaction+cmult*sqrt(pvar1)
,tlo = newdf$Reaction-cmult*sqrt(tvar1)
,thi = newdf$Reaction+cmult*sqrt(tvar1))

gg0 + 
    geom_point(data=newdf,aes(y=Reaction)) +
    geom_ribbon(data=newdf,
                aes(ymin=plo, ymax=phi, fill=Treatment), alpha=0.4,
                colour=NA)

enter image description here

与所估计斜率和标准误差比较:

m0 <- lm(Reaction~height*Treatment,sleepstudy)
ff <- function(m) {
    print(coef(summary(m))[-1,c("Estimate","Std. Error")],digits=2)
}

> ff(m0)
##                      Estimate Std. Error
## height                   -0.3       0.94
## TreatmentDrug          -602.2     234.01
## height:TreatmentDrug      3.5       1.34

ff(m.sleep)
##                      Estimate Std. Error
## TreatmentDrug          -55.03      425.3
## height                   0.41        1.7
## Days                    10.47        1.5
## TreatmentDrug:height     0.33        2.4

这看起来一致/有关的权利:混合模式是给较大的标准误斜率与高度,高度:治疗作用。 (TreatmentDrug的主要影响看疯了,因为他们的治疗在height==0预期的效果...)


作为一个交叉检查,我可以从sjPlot::plot_model()类似的答案...

library(sjPlot)
plot_model(m.sleep, type="pred", terms=c("height","Treatment"))

enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.