R 中是否有函数可以根据逻辑回归结果创建模拟分类变量?

问题描述 投票:0回答:1

我可以从一组逻辑回归结果创建连续变量和二元变量。然而,我正在努力从一组逻辑回归结果创建分类变量。

例如,基于基本 R 中提供的泰坦尼克号数据集,我可以为成人和男性创建模拟变量。但是,我无法模拟有 4 个级别(第一、第二、第三和船员)的分类变量“Class”的变量。二元响应变量是 Survived。

R代码如下。

任何有关如何模拟“类”变量的建议将不胜感激。

非常感谢您的宝贵时间,

保罗

library(dplyr)
library(tidyr)
library(sjPlot)
# Step 1 - Load the dataset
data(Titanic)
# Step 2 - Transform the Titanic dataset
mydata \<- reshape2::melt(Titanic) %\>%
uncount(value) %\>%
as_tibble
mydata
# Step 3 - Create the dummy variables
mydata$Crew \<- ifelse(mydata$Class == "Crew", 1, 0)
mydata$First \<- ifelse(mydata$Class == "1st", 1, 0)
mydata$Second \<- ifelse(mydata$Class == "2nd", 1, 0)
mydata$Third \<- ifelse(mydata$Class == "3rd", 1, 0)
mydata$Male \<- ifelse(mydata$Sex == "Male", 1, 0)
mydata$Female \<- ifelse(mydata$Sex == "Female", 1, 0)
mydata$Child \<- ifelse(mydata$Age == "Child", 1, 0)
mydata$Adult \<- ifelse(mydata$Age == "Adult", 1, 0)
#Step 4 - Fit the logistic regression model
glm.1 \<- glm(Survived \~ Second + Third + Crew + Adult + Male, family = binomial("logit"), data = mydata )
#Step 5 – View the model summary
summary(glm.1)
# Step 6 - Create the simulated data
set.seed(016752277)
Male \<- sample(c(0,1), size = 2000, replace = TRUE)
Adult \<- sample(c(0,1), size = 2000, replace = TRUE)
xb \<- 3.1054 + -2.4201*Male + -1.0615*Adult
p \<- 1/(1 + exp(-xb))
Survived \<- rbinom(n = 2000, size = 1, prob = p)
# Step 7 - Fit the logistic regression model based on the simulated data
glm.1simulated \<- glm(Survived \~ Male + Adult, family = "binomial")
# Step 8 - View the model summary based on the simulated data
summary(glm.1simulated)
# Step 9 – View both models side by side
tab_model(glm.1, glm.1simulated)
r simulation logistic-regression categorical-data
1个回答
0
投票

你可能让这件事变得比需要的更难了。无需创建所有这些虚拟变量 - 您的初始模型就像

一样简单
data(Titanic)

mydata <- tidyr::uncount(as.data.frame(Titanic), Freq)

glm.1 <- glm(Survived ~ ., mydata, family = binomial)

summary(glm.1)
#> 
#> Call:
#> glm(formula = Survived ~ ., family = binomial, data = mydata)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.6853     0.2730   2.510   0.0121 *  
#> Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
#> Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
#> ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
#> SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
#> AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2769.5  on 2200  degrees of freedom
#> Residual deviance: 2210.1  on 2195  degrees of freedom
#> AIC: 2222.1
#> 
#> Number of Fisher Scoring iterations: 4

对于模拟数据的预测变量,只需对原始数据框中因素的独特水平进行采样即可

set.seed(016752277)

new_data <- data.frame(Class = sample(unique(mydata$Class), 2000, TRUE),
                       Sex   = sample(unique(mydata$Sex), 2000, TRUE),
                       Age   = sample(unique(mydata$Age), 2000, TRUE))

要获取因子水平隐含的概率,您可以使用

predict
而不是手动计算对数赔率并将其转换为概率:

probs    <- predict(glm.1, newdata = new_data, type = "response")

现在你的模拟就是:

new_data$Survived <- rbinom(n = 2000, size = 1, prob = probs)

你的模拟模型是:

glm.1simulated <- glm(Survived ~ ., new_data, family = binomial)

summary(glm.1simulated)
#> 
#> Call:
#> glm(formula = Survived ~ ., family = binomial, data = new_data)
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.4911     0.1234   3.980 6.90e-05 ***
#> Class2nd     -0.9145     0.1561  -5.857 4.72e-09 ***
#> Class3rd     -1.7217     0.1605 -10.727  < 2e-16 ***
#> ClassCrew    -0.8648     0.1548  -5.588 2.30e-08 ***
#> SexFemale     2.4114     0.1176  20.505  < 2e-16 ***
#> AgeAdult     -0.9605     0.1115  -8.614  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2747.9  on 1999  degrees of freedom
#> Residual deviance: 2080.5  on 1994  degrees of freedom
#> AIC: 2092.5
#> 
#> Number of Fisher Scoring iterations: 4

这与原始模型相比效果很好:

sjPlot::tab_model(glm.1, glm.1simulated)

创建于 2023-08-19,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.