混合效应逻辑回归

Question

我正在尝试在 python 中实现混合效应逻辑回归。作为比较，我使用 R 中

glmer

包中的

lme4

函数。

我发现

statsmodels

模块有一个

BinomialBayesMixedGLM

应该能够适合这样的模型。但是，我遇到了很多问题：

我发现
```
statsmodels
```
函数的文档并不完全有帮助或不清楚，所以我不完全确定如何正确使用该函数。
到目前为止，我的尝试尚未产生复制我在 R 中使用
```
glmer
```
拟合模型时得到的结果。
我预计
```
BinomialBayesMixedGLM
```
函数不会计算 p 值，因为它是贝叶斯函数，但我似乎无法弄清楚如何访问参数的完整后验分布。

作为测试用例，我使用此处提供的泰坦尼克号数据集。

import os
import pandas as pd
import statsmodels.genmod.bayes_mixed_glm as smgb

titanic = pd.read_csv(os.path.join(os.getcwd(), 'titanic.csv'))

r = {"Pclass": '0 + Pclass'}
mod = smgb.BinomialBayesMixedGLM.from_formula('Survived ~ Age', r, titanic)
fit = mod.fit_map()
fit.summary()

#           Type    Post. Mean  Post. SD       SD SD (LB) SD (UB)
# Intercept M           3.1623    0.3616            
# Age       M          -0.0380    0.0061            
# Pclass    V           0.0754    0.5669    1.078   0.347   3.351

但是，除了年龄的斜率之外，这似乎与我在 R 中使用

glmer(Survived ~ Age + (1 | Pclass), data = titanic, family = "binomial")

得到的结果不匹配：

Random effects:
 Groups Name        Variance Std.Dev.
 Pclass (Intercept) 0.8563   0.9254  
Number of obs: 887, groups:  Pclass, 3

Fixed effects:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.961780   0.573402   1.677   0.0935 .  
Age         -0.038708   0.006243  -6.200 5.65e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

那么在 python 中创建模型时我犯了什么错误？而且，一旦解决了这个问题，我如何提取后验或 p 值？最后，他们的混合效应逻辑回归的 Python 实现是否更类似于 R 中的实现？

Answer 1

只需按照评论中的建议，与

Python

做类似的事情

Pymer4

似乎提供了一种合适的方法（特别是如果您无论如何都熟悉

）。使用问题中提到的示例数据集“泰坦尼克号”：

from pymer4.models import Lmer

model = Lmer("Survived  ~ Age  + (1|Pclass)",
             data=titanic, family = 'binomial')

print(model.fit())

输出：

Formula: Survived~Age+(1|Pclass)

Family: binomial     Inference: parametric

Number of observations: 887  Groups: {'Pclass': 3.0}

Log-likelihood: -525.812     AIC: 1057.624

Random effects:

               Name    Var    Std
Pclass  (Intercept)  0.856  0.925

No random effect correlations specified

Fixed effects:

             Estimate  2.5_ci  97.5_ci     SE     OR  OR_2.5_ci  OR_97.5_ci  \
(Intercept)     0.962  -0.162    2.086  0.573  2.616       0.85       8.050   
Age            -0.039  -0.051   -0.026  0.006  0.962       0.95       0.974   

              Prob  Prob_2.5_ci  Prob_97.5_ci  Z-stat  P-val  Sig  
(Intercept)  0.723        0.460         0.889   1.677  0.093    .  
Age          0.490        0.487         0.493  -6.200  0.000  ***

作为附加评论（抱歉偏离了主要问题），我在带有

Ubuntu 20.04

的

Python 3.8.8

机器上运行了这个。不确定其他人是否遇到了这个问题，但是当使用

Pymer4

运行上面的模型时，包抛出了一个错误（当我尝试从

Pymer4

文档这里重现类似的模型时，出现同样的错误）：

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

这可以通过更新 pymer4 版本来修复（另请参阅此相关问题）：

conda install -c ejolly -c conda-forge -c defaults pymer4=0.7.8

Answer 2

差异可能源于不同的算法：

glmer

使用所涉及积分的近似评估（高斯-埃尔米特求积），而

BinomialBayesMixedGLM

使用变分贝叶斯估计。相关的 Statsmodel 页面提供了一个链接，指向一篇讨论不同评估方法之间差异的文章。在 Statsmodels 中使用

.fit_map()

（拉普拉斯近似）与

.fit_vb()

（变分贝叶斯）时也可以看到差异 - 两者都不同于

glmer

使用的方法。

IMO OP中给出的代码是正确的，尽管我得到了一些不同的结果：

from __future__ import division
import numpy as np
import pandas as pd
import statsmodels.api as sm

titanic = pd.read_csv("titanic.csv")
r = {"Pclass": '0 + C(Pclass)'}
glmm = sm.BinomialBayesMixedGLM.from_formula('Survived~Age', r, data=titanic)
res = glmm.fit_vb()
print(res.summary())

导致

               Binomial Mixed GLM Results
========================================================
          Type Post. Mean Post. SD   SD  SD (LB) SD (UB)
--------------------------------------------------------
Intercept    M     0.8947   0.0747                      
Age          M    -0.0386   0.0023                      
Pclass       V     0.0565   0.3750 1.058   0.500   2.240
========================================================
Parameter types are mean structure (M) and variance
structure (V)
Variance parameters are modeled as log standard
deviations

无需借助公式即可获得相同的结果，这也许可以提供更多的清晰度（并且希望对像我这样第一次遇到 Statsmodels 的人有所帮助 - 就像我发现文档不太透明的 OP）非统计学家）：

titanic['Intercept'] = 1.
exog_vc = titanic.apply(lambda x: [x.Pclass==j for j in range(1, 4)], axis=1, result_type='expand').astype(int)
glmm = sm.BinomialBayesMixedGLM(titanic['Survived'], titanic[['Intercept', 'Age']], exog_vc, pd.Series([0, 0, 0]))
res1 = glmm.fit_vb()
print(res1.summary())

这里最后一个参数（

ident

）允许放宽对不同组件方差的限制，例如，如果我们采用

ident = pd.Series([0, 1, 2])

，所有三个类都可以具有不同的方差。

最后，可以从结果对象中收集更多信息，特别是分量的均值和方差：

res.params

、

res.vc_mean

、

res.vc_sd

、

res.vcp_mean

、

res.vcp_sd

。请注意，

vcp_mean

实际上是方差的对数，因为它服从对数正态分布 - 因此它可以是负数。

来自文档：

后验分布中存在三种类型的值：固定效应参数（fep），对应于exog的列，随机效应实现（vc），对应于exog_vc的列，以及随机效应实现的标准差（ vcp)，对应ident中唯一的整数标签。

所有随机效应都被建模为独立的高斯值（给定方差结构参数）。 exog_vc 的每一列都有一个独特的实现随机效应，用于形成线性预测变量。 ident 的元素决定了不同的方差结构参数。 ident 中具有相同值的两个随机效应实现具有相同的方差。当使用公式拟合时，ident 是在内部构造的（vc_formulas 的每个元素都会在 ident 中生成一个不同的标签）。

随机效应标准差参数 (vcp) 具有平均值为 0 且标准差为 vcp_p 的对数正态先验分布。

混合效应逻辑回归

问题描述投票：0回答：2

2个回答

最新问题

混合效应逻辑回归

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2