RStudio 中的逐步多元线性回归

Question

我的数据有 6 个自变量和 5 个因变量，我正在对它们运行逐步多元线性回归。我的代码在第一次回归中删除 p 值 > 0.05 的所有自变量，再次运行回归，在循环中删除 p 值 > 0.05 的自变量等，直到只有那些 p 值的自变量 < 0.05 remain.

R 代码在前三个因变量中工作得很好，所以我知道它可以工作，但是当它开始打印因变量 4 的回归的最终摘要时，它就停止了。

在此示例数据中，它仅处理第一个因变量，然后停止，如下所示： RStudio 控制台输出停止

样本数据： | IndVar1 | IndVar2 | IndVar3 | IndVar4 | IndVar5 | IndVar6 | DepVar1 | DepVar2 | DepVar3 | DepVar4 | DepVar5 | | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | | 114156 | 114156 131937 | 131937 10967 | 10967 16260| 9772 | 12643 | 12643 1674605 | 1674605 12721 | 2.55 | 2.55 42820 | 2.70 | 2.70 | 93447 | 93447 128287 | 128287 9733 | 13994 | 8682 | 15848 | 1731694 | 12169 | 12169 1.73 | 1.73 28394 | 28394 2.59 | 2.59 | 113909 | 113909 108688 | 9485 | 14936 | 6849 | 11522 | 1910052 | 9924 | 1.08 | 1.08 38820 | 1.93 | 1.93 | 103983 | 94270 | 94270 10251 | 10251 11645 | 11645 10667 | 10667 11114 | 1955383 | 4358 | 0.93 | 0.93 50756 | 1.48 | 1.48 | 111403 | 111403 137057 | 137057 9426 | 15831 | 9223 | 9223 15946 | 1698939 | 9992 | 1.88 | 1.88 30729 | 30729 2.03 | 2.03 | 93071 | 100127 | 100127 6040 | 6040 11916 | 11916 7591 | 16181 | 16181 1563494 | 5290 | 5290 1.77 | 1.77 39528 | 1.81 | 1.81 | 115366 | 130482 | 7413 | 11227 | 11227 10003 | 10003 13360 | 1573640 | 11456 | 11456 2.62 | 2.62 30783 | 1.63 | 1.63 | 98578 | 109484 | 6648 | 14199 | 14199 7660 | 7660 11343 | 1836676 | 10956 | 10956 2.13 | 2.13 39551 | 39551 1.53 | 1.53 | 98487 | 98487 97535 | 97535 9862 | 16051 | 16051 6431 | 12682 | 1726679 | 1726679 7254 | 7254 2.32 | 2.32 38429 | 2.70 | 2.70 | 105592 | 105592 139141 | 139141 6605 | 6605 15330 | 8149 | 13412 | 1768668 | 4089| 1.68 | 1.68 46858 | 1.87 | 1.87 | 102727 | 102727 122752 | 122752 9364 | 14205 | 14205 9111 | 16175 | 16175 1679431 | 12761 | 0.99 | 0.99 29727 | 29727 2.33 | 2.33 | 108970 | 124523 | 124523 8749 | 13756 | 10551 | 10551 11187 | 11187 1890720 | 10360| 1.84 | 1.84 52171 | 2.15 | 2.15 | 112604 | 112604 111948 | 8621 | 14175 | 14175 8448 | 16494 | 16494 1837590 | 4999 | 1.40 | 1.40 37714 | 2.09 | 2.09 | 103223 | 103223 92846 | 92846 10285 | 10285 12530| 10296 | 10296 14926 | 1677332 | 5387| 1.34 | 1.34 30164 | 2.35 | 2.35 | 95176 | 104659 | 104659 9700 | 9700 15572 | 6291 | 14569 | 14569 1794722 | 8468 | 0.93 | 0.93 43782 | 43782 2.48 | 2.48 | 100051 | 112592 | 112592 8926 | 15048 | 10131| 12954 | 12954 1715671 | 7446 | 2.00 | 2.00 43410 | 43410 1.54 |

我已经在 Excel 中运行了非常痛苦/手动回归分析中的所有逐步回归，并为所有因变量获得了良好的结果。

这是代码：

# Function to perform stepwise regression
perform_stepwise_regression <- function(dependent_variable, independent_variables, data) {
  cat("======================================================================\n")
  cat("Dependent Variable: ", dependent_variable, "\n")
  cat("======================================================================\n")
  
  # Fit initial model
  lm_model <- lm(as.formula(paste(dependent_variable, "~", paste(independent_variables, collapse = " + "))), data = df)
  
   while (TRUE) {
    # Check p-values
    p_values <- summary(lm_model)$coefficients[, 4]
    
    # Identify variables with the highest p-value > 0.05
    variable_to_remove <- names(p_values[p_values == max(p_values)])
    
    # Break the loop if no variable has a p-value > 0.05
    if (max(p_values) <= 0.05) {
      break
    }
    
    # Create a new formula excluding the variable
    predictors_to_include <- setdiff(names(lm_model$model)[-1], variable_to_remove)
    
    new_formula <- as.formula(paste(dependent_variable, "~", paste(predictors_to_include, collapse = " + ")))
    
    # Fit a new model
    lm_model <- lm(new_formula, data = df)
  }
  
  # Create the final formula using the single remaining independent variable
  final_formula <- as.formula(paste(dependent_variable, "~", predictors_to_include))
  
  # Fit the final model using the created formula
  final_model <- lm(final_formula, data = df)
  
 # Print the final summary in the console
  print(summary(lm_model))
}

# Example usage:
dependent_variables <- c("DepVar1", "DepVar2", "DepVar3", "DepVar4", "DepVar5")
independent_variables <- c("IndVar1", "IndVar2", "IndVar3", "IndVar4", "IndVar5", "IndVar6")

for (dependent_variable in dependent_variables) {
  perform_stepwise_regression(dependent_variable, independent_variables, df)
}

Google Gemini 在这里毫无价值，ChatGPT 在玩弄我。我需要人工干预。有谁知道会发生什么吗？

Answer 1

对于

DepVar2

的回归，初始模型中最不显着的项是截距。创建新公式时，您将尝试从不包含

"(Intercept)"

的向量中删除

"(Intercept)"

，该向量仅返回相同的向量。因此，下一次迭代使用完全相同的自变量，您将陷入无限循环。它对于

DepVar1

效果很好，因为截距很重要。

您可以通过删除

p_values

的截距来解决此问题，例如使用

p_values[!(names(p_values)=='(Intercept)')]

。我假设您想将截距保留在模型中，即使它并不重要。

RStudio 中的逐步多元线性回归

问题描述投票：0回答：1

1个回答

最新问题

RStudio 中的逐步多元线性回归

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1