我有代码在多个选定因变量(称为result1-4)上进行逻辑回归。我只想在满足自变量条件的情况下运行模型。假设每个结果和类型组合至少需要两位女性。
虚拟数据:
set.seed(5)
df <- data.frame(
id = c(1:100),
age = sample(20:80, 100, replace = TRUE),
sex = sample(c("M", "F"), 100, replace = TRUE, prob = c(0.7, 0.3)),
type = sample(letters[1:4], 100, replace = TRUE),
outcome1 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.68, 0.32)),
outcome2 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.65, 0.35)),
outcome3 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.60, 0.40)),
outcome4 = sample(c(0L, 1L), 100, replace = TRUE, prob = c(0.45, 0.55)))
循环GLM的代码(信https://stats.idre.ucla.edu/r/codefragments/looping_strings/):
outcomelist <- names(df)[5:8]
modelall <- lapply(outcomelist, function(x) {
glm(substitute(i ~ type + sex, list(i = as.name(x))), family = "binomial", data = df)})
我发现了很多关于循环的问题,但没有其他条件。我在考虑子集,但不是专业人士,但我不知道该放在哪里。
如果这不是一个额外的问题,我希望将每个模型都命名为列表中结果变量的名称(而不是1到4),因为否则添加条件时将很难跟踪模型。
感谢任何帮助!
一种可能是在运行lapply()
之前清除数据:
df.new <- df
for(ii in 1:length(outcomelist)){
temp <- outcomelist[ii]
# check the condition for outcome variable ii
condition <- any(aggregate(df$sex=="F", by=list(df$type, df[,temp]), FUN="sum")$x < 2)
if(condition){
# if the condition is met, remove the variable from df and outcomelist
df.new[,temp] <- NULL
outcomelist[ii] <- NA
}
}
# lose irrelevant outcomes
outcomelist <- na.omit(outcomelist)
modelall <- lapply(outcomelist, function(x) {
glm(substitute(i ~ type + sex, list(i = as.name(x))), family = "binomial", data = df.new)})
# name the list
names(modelall) <- outcomelist