线性回归函数向量化

Question

我想编写一个脚本来在整个 data.table 中运行回归模型，其中我的函数适合模型并提取信息以供以后分析。我有大量模型需要拟合，所以我想加快速度。为了做到这一点，我想向量化这个函数，但是由于数据中散布着 NA 值，并且我不想估算缺失值，所以我一直在努力，只需将每个模型与可用数据进行拟合即可。我还想从模型中提取一些统计数据。所以我的问题是，这个函数可以向量化吗？如果可以的话如何向量化？

总而言之，目的是使用“协变量”列作为协变量（最后一列）来拟合“dependent_var”（第一列）与每个预测变量列（单独）的简单线性回归模型。因此，对于每个模型，我想要以下内容：

return(indepent_var = col, 
       pearson = cor_coef, 
       pearson_p = p_value, 
       indepent_var_beta = slope, 
       indepent_var_beta_sig = slope_sig,
       adj_r_squared = adj_r_squared, 
       indepent_var_beta_se = se_slope, 
       intercept_se = se_intercept, 
       n_obs = nrow(dfx_subset)) # dfx subset being the subset used to fit the model in question.

玩具数据

作为示例数据集

library(data.table)

set.seed(123)
num_rows <- 300
num_cols <- 6000
# Random data
data <- matrix(rnorm(num_rows * num_cols), nrow = num_rows)
# Many columns include NAs which need to be handled during fit without imputation
prop_na <- runif(num_cols, min = 0.1, max = 0.5)
for (i in 1:num_cols) {
  num_na <- round(prop_na[i] * num_rows)
  idx_na <- sample(1:num_rows, num_na)
  data[idx_na, i] <- NA
}
# Mock table, dependent variable and binary covariate variable
DT <- as.data.table(cbind(data, covariate = sample(0:1, num_rows, replace = TRUE)))
setnames(DT, old = c("V1"), new = c("dependent_var"))

我将来可能会包含更多协变量，因此希望包含皮尔逊系数以进行一些快速检查/潜在的未来分析。由于我有大量数据需要处理，因此最好在 HPC 上提交一次。到目前为止，这是我的功能：

线性回归函数

lm_analysis <- function(col, dfx) {
  tryCatch({
    dfx_subset <- dfx[complete.cases(dfx[[1]], dfx[[col]], dfx[["covariate"]]), ]
    ## Compute Pearson correlation
    cor_test <- cor.test(dfx_subset[[1]], dfx_subset[[col]])
    cor_coef <- cor_test$estimate
    p_value <- cor_test$p.value
    ## Linear regression
    lm_result <- lm(dfx_subset[[1]] ~ dfx_subset[[col]] + dfx_subset[["covariate"]], data = dfx_subset)
    slope <- lm_result$coefficients[2]
    model_summary <- summary(lm_result)
    slope_sig <- model_summary$coefficients[2, 4]
    adj_r_squared <- model_summary$adj.r.squared
    se_slope <- model_summary$coefficients[2, "Std. Error"]
    se_intercept <- model_summary$coefficients[1, "Std. Error"]
    
    return(c(indepent_var = col, 
             pearson = cor_coef, 
             pearson_p = p_value, 
             indepent_var_beta = slope, 
             indepent_var_beta_sig = slope_sig,
             adj_r_squared = adj_r_squared, 
             indepent_var_beta_se = se_slope, 
             intercept_se = se_intercept, 
             n_obs = nrow(dfx_subset)))
  }, error = function(e){
    return(NULL)
  })
}

功能应用：

predictor_cols <- setdiff(names(DT), c("dependent_var", "covariate"))

lm_results <- lapply(predictor_cols, lm_analysis, dfx = DT)

results_lmfit <- as.data.table(do.call(rbind, lm_results))

我不想在分析之前过滤任何观察结果。另外，理想情况下，我不想并行化此组件，因为我计划在外循环上执行此操作，我将在外循环上对因变量的 data.table 执行上述操作。

我在如何向量化方面处于一个松散的状态，并且由于不同数据集之间的数据不匹配，我不确定这是否可以轻松实现。任何建议将不胜感激。

Answer 1

您可以融合数据集并应用一个稍微简单的函数，其中返回的对象是一个列表：

功能稍微简单一点

lm_analysis_simple <- function(d,i,c) {
  
    cor_test <- cor.test(d,i)
    ## Linear regression
    lm_result <- lm(d ~ i+c, model=FALSE)
    model_summary = summary(lm_result)

    return(list(
             pearson = cor_test$estimate, 
             pearson_p = cor_test$p.value, 
             indepent_var_beta = lm_result$coefficients[2], 
             indepent_var_beta_sig = model_summary$coefficients[2, 4],
             adj_r_squared = model_summary$adj.r.squared, 
             indepent_var_beta_se = model_summary$coefficients[2, "Std. Error"], 
             intercept_se = model_summary$coefficients[1, "Std. Error"],
             n_obs = length(i)
    ))
}

融化数据集

DT_long = melt(DT,id.vars = c("dependent_var", "covariate"),variable.name = "independ_var")

通过每个自变量应用函数

na.omit(DT_long)[, lm_analysis_simple(dependent_var,value,covariate),independ_var]

输出：

      independ_var     pearson  pearson_p indepent_var_beta indepent_var_beta_sig adj_r_squared indepent_var_beta_se intercept_se n_obs
            <fctr>       <num>      <num>             <num>                 <num>         <num>                <num>        <num> <int>
   1:           V2  0.01870141 0.83733080       0.006931742            0.94264543  -0.006405378           0.09614583    0.1301829   123
   2:           V3 -0.04812460 0.55734413      -0.044441539            0.54942591  -0.002966220           0.07406963    0.1113687   151
   3:           V4  0.08896426 0.27255406       0.094223500            0.23947211  -0.001749838           0.07978495    0.1087159   154
   4:           V5 -0.14207020 0.07881438      -0.152165312            0.07242134   0.010553751           0.08411006    0.1092392   154
   5:           V6 -0.01676149 0.85888333      -0.008136059            0.92309369   0.004776748           0.08408907    0.1087452   115
  ---                                                                                                                                  
5995:        V5996 -0.15043800 0.10393970      -0.133171088            0.10896949   0.005849612           0.08243966    0.1128392   118
5996:        V5997  0.01159498 0.89068030       0.009851107            0.90131503  -0.011246345           0.07930029    0.1115778   143
5997:        V5998 -0.13492708 0.16384292      -0.133010591            0.13078530   0.010763237           0.08733929    0.1177022   108
5998:        V5999 -0.10859013 0.17441982      -0.113119773            0.17429042   0.003738834           0.08288311    0.1106717   158
5999:        V6000  0.07089022 0.47237635       0.059207306            0.48172533  -0.013942846           0.08384918    0.1323993   105

线性回归函数向量化

问题描述投票：0回答：1

玩具数据

线性回归函数

功能应用：

1个回答

功能稍微简单一点

融化数据集

通过每个自变量应用函数

最新问题

线性回归函数向量化

问题描述 投票：0回答：1

玩具数据

线性回归函数

功能应用：

1个回答

功能稍微简单一点

融化数据集

通过每个自变量应用函数

最新问题

问题描述投票：0回答：1