随机拦截GLM

Question

我想在R中拟合一个随机截距互补对数回归，以检查未观察到的用户异质性.我通过互联网和书籍搜索，只在Stata中找到了一个解决方案，也许有人可以将其改编为R.在Stata中，有2个命令可用。

xtcloglog 在Stata中，有两个命令：
gllamm 对于随机系数模型和更高层次的模型

我的数据关系到人们的活动是否完成，是否受到阳光的影响--。completion 是结果变量，而 sunshine 和下面提到的其他变量将是解释变量，这是一个简化版本。

    581755 obs. of 68 variables:
     $ activity          : int  37033 37033 37033 37033 37033 37033 37033 37033 37033 37033 ...
     $ people         : int  5272 5272 5272 5272 5272 5272 5272 5272 5272 5272 ...
     $ completion: num 0 0 0 0 0 0 0 0 0 0 ...
     $ active            : int  0 2 2 2 2 2 2 2 2 2 ...
     $ overdue           : int  0 0 0 0 0 0 0 0 0 0 ...
     $ wdsp              : num  5.7 5.7 7.7 6.4 3.9 5.8 3.5 6.3 4.8 9.4 ...
     $ rain              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ UserCompletionRate: num [1:581755, 1] NaN -1.55 -1.55 -1.55 -1.55 ...
      ..- attr(*, "scaled:center")= num 0.462
      ..- attr(*, "scaled:scale")= num 0.298
     $ DayofWeekSu       : num  0 0 0 0 0 1 0 0 0 0 ...
     $ DayofWeekMo       : num  0 0 0 0 0 0 1 0 0 0 ...
     $ DayofWeekTu       : num  1 0 0 0 0 0 0 1 0 0 ...
     $ DayofWeekWe       : num  0 1 0 0 0 0 0 0 1 0 ...
     $ DayofWeekTh       : num  0 0 1 0 0 0 0 0 0 1 ...
     $ DayofWeekFr       : num  0 0 0 1 0 0 0 0 0 0 ...

     $ MonthofYearJan    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearFeb    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMar    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearApr    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearMay    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearJun    : num  1 1 1 1 1 1 1 1 1 1 ...
     $ MonthofYearJul    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearAug    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearSep    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearOct    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ MonthofYearNov    : num  0 0 0 0 0 0 0 0 0 0 ...
     $ cold              : num  0 0 0 0 0 0 0 0 0 0 ...
     $ hot               : num  0 0 0 0 0 0 0 0 0 0 ...
     $ overduetask       : num  0 0 0 0 0 0 0 0 0 0 ...

原始（简化）数据。

 df <-  people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)

到目前为止，我已经为cloglog使用了这段代码。

model <- as.formula("completion ~  sunshine")
clog_full = glm(model,data=df,family = binomial(link = cloglog))
summary(clog_full)

使用包glmmML:

model_re <- as.formula("completion ~  sunshine")
clog_re = glmmML(model_re,cluster = people, data= df,
    family = binomial(link = cloglog)) 
summary(clog_re)

使用包lme4:

model_re1<- as.formula("completion ~  (1|people) + sunshine") 
clog_re1 <- glmer(model_re1, data=df,
   family = binomial(link = cloglog)) 
summary(clog_re1)

然而，R没有得到任何结果，只是运行它们，但从来没有得到结果。我是否必须使用people或activities作为一个集群？

如果有谁，也有一个想法，如何运行这个模型与固定截距，我很高兴知道。

Answer 1

这个对我来说很好用（差不多：见下面的注释）。

## added data.frame()
df <-  data.frame(people = c(1,1,1,2,2,3,3,4,4,5,5),
        activity = c(1,1,1,2,2,3,4,5,5,6,6),
        completion = c(0,0,1,0,1,1,1,0,1,0,1),
        sunshine = c(1,2,3,4,5,4,6,2,4,8,4)
        )

model_re1 <- completion ~  (1|people) + sunshine
clog_re1 <- glmer(model_re1, data=df,
                  family = binomial(link = cloglog))

这很快就结束了（不到一秒钟）：也许你忘了关闭引号或括号或其他东西......？
然而，它确实产生了一个消息 "边界（奇异）拟合：见?isSingular"，这是因为你的数据集太小了，以至于人与人之间差异的最佳估计为零（因为它不可能是负的）。

更新 "边界（奇异）拟合度"：见 "isSingular"，这是因为你的数据集非常小众，人与人之间的最佳估计值为零（因为可以是负值）。: 我很抱歉地告诉你，混合模型（GLMMs）比标准GLMs的计算量大得多。50万个观测值和68个预测变量绝对是个大问题，你应该预计拟合需要几个小时。我有一些建议。

你一定要尝试一下你的数据子集（包括观测值和预测变量），以了解计算时间的伸缩性
的 glmmTMB 据我所知，R包是解决这个问题最快的选择(lme4 将会对大量的预测变量进行严重的扩展），但可能会遇到内存限制。在 MixedModels.jl Julia包可能更快。
你通常可以打开 "verbose "或 "tracing "选项，让你知道模型至少在解决问题，而不是完全卡住了（要知道需要多长时间才能完成并不可行，但至少你知道有些事情正在发生......）。
如果Stata的速度更快（我怀疑，但有可能），你可以使用它

下面是一个有10000个观测值和一个预测变量的例子。

n <- 10000
set.seed(101)
dd <- data.frame(people=factor(rep(1:10000,each=3)),
                 sunshine=sample(1:8,size=n, replace=TRUE))
dd$completion <- simulate(~(1|people)+sunshine,
                          newdata=dd,
                          family=binomial(link="cloglog"),
                          newparams=list(beta=c(0,1),
                                         theta=1))[[1]]

glmer 运行了80秒，然后失败。

system.time(update(clog_re1, data=dd, verbose=100))

另一方面： glmmTMB 在20秒内解决这个问题（我的电脑上有8个核心，而且 glmmTMB 使用了所有的核心，所以这个任务的CPU分配上升到750%；如果你的核心数较少，计算时间也会相应增加）。)

library(glmmTMB)
system.time(glmmTMB(model_re1,data=dd,family = binomial(link = cloglog),
                    verbose=TRUE))

如果我再增加4个预测变量（共5个），计算时间就会增加到46秒（同样使用8个内核：所有内核的总计算时间为320秒）。由于预测变量的数量是原来的13倍，观测值的数量是原来的50倍，你肯定应该预料到这是一个具有挑战性的计算。

A 很对异质性的粗略评估是拟合同质模型，并将残差偏差（或皮尔逊残差平方和）与模型的残差自由度进行比较；如果前者大得多，那就证明存在某种形式的误拟（异质性或其他）。

随机拦截GLM

问题描述投票：0回答：1

1个回答

最新问题

随机拦截GLM

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1