AUC = 1的可能原因（来自拟合的glm模型？

Question

我正在运行高通量微阵列数据（甲基化阵列），并且在运行了单变量，套索和交叉验证套索分析之后，我能够得出15个探针（预测变量）的列表。

现在，我想运行ROC / AUC曲线，以检查那些预测变量实际上是否是好的候选变量。问题在于从中得出的结果是AUC = 1的ROC曲线。我一直在尝试抽动拟合的模型（即family和maxit），但结果没有改变。

这里是数据样本（带有8个预测变量）和下游分析，并带有一些解释：

          Tumor   probe_1     probe_2     probe_3    probe_4    probe_5    probe_6    probe_7     probe_8
Benign.A4    No -5.076257 -3.18658187 -2.91627872 -3.2393655 -2.4080861 -3.9414602 -4.5844204 -2.96877633
Benign.A1    No -3.232952 -2.21518181  0.71340947 -2.1103999 -1.4563154 -4.0614544 -2.9378821 -0.90468942
Benign.C2    No -4.487701 -3.34515435 -5.35341349 -2.0355878 -2.9573763 -4.2980546 -4.3421487 -2.35597830
Benign.C8    No -3.692610 -1.24332686 -0.59115736 -3.4852858 -2.3339160 -3.1302782 -3.0943430 -1.03581249
Benign.D7    No -2.978757 -0.05097524  0.02744634 -1.4946543 -1.5593915 -2.8860660 -2.7633458 -0.99299595
Benign.D3    No -2.441925 -1.98227873 -2.13478645 -3.0265593 -2.7789079 -3.9860489 -2.8512663 -2.61804934
Tumor.A6    Yes  1.044348 -5.85637090 -4.49697162  1.5033139  0.3226736  1.5937440 -0.4881769  0.95135529
Tumor.A5    Yes  1.749187 -2.93393903 -5.54439148  2.4403760  1.6238294 -1.1699169  3.0410728  1.07437064
Tumor.A2    Yes  2.323806 -6.57693143 -5.78690184  1.7684931  2.3522317  0.3517146 -1.9972320  1.46663990
Tumor.C1    Yes  2.229316 -6.69010615 -6.22036584  0.7482678  1.3277280  0.6128029  1.3349142  1.63602050
Tumor.C6    Yes  2.888489 -5.79079519 -5.02991621  1.4605461  1.3002248  1.1498193  0.4481215  0.81473797
Tumor.C5    Yes -1.861726 -5.14400193 -5.26197761  1.0023323  0.8582683  0.5492184  0.6720438  1.73785369
Tumor.D1    Yes  2.776804 -6.78537165 -6.20280759  2.0623420  1.8291220  1.7328508  1.3667038  1.77813837
Tumor.D6    Yes  2.985209 -6.13405436 -5.92181030  1.8801728  1.1815045  2.2210693  0.1363381  2.21102559
Tumor.D8    Yes  1.670136 -6.72855542 -6.61156537  1.9847271  1.6267041 -2.8621148  0.7134887 -0.56794735
Tumor.A3    Yes  2.106628 -5.61286600 -5.75976883  2.1291475  0.5839721  1.4210874  1.2746626  1.77239233
Tumor.A8    Yes  1.798005 -5.53405698 -5.34042037  3.0262657  1.2199790  1.2448107  1.2297283  0.25649834
Tumor.A7    Yes  1.798074 -6.03775348 -5.01964376  1.2428083  2.3899569  0.6292222  0.6439477  0.92047002
Tumor.C3    Yes  1.542737 -6.54219832 -5.94287577  1.6111676  2.1889028  0.1228641  0.7950770  1.38000135
Tumor.C7    Yes  3.369420 -6.84809093 -5.88474727  2.7525838  3.2090893  1.1435739  1.2199450  0.89089956
Tumor.C4    Yes  3.179484 -6.59432541 -5.68920298  2.4093288  2.3173752 -0.3378846  1.3653768  0.66432101
Tumor.D5    Yes  2.328382 -6.41234621 -6.18003184 -0.1768171  2.1202506  2.4287615  1.7804487  0.08098025
Tumor.D4    Yes  3.051829 -7.01875245 -6.32614849  1.4200916  2.3582254  2.4981644  1.7878118  1.14826500
Tumor.D2    Yes  2.686846 -3.57625801 -6.25573666  1.6330575  0.8448418  1.4229245 -0.6461006  0.09491185

glm分析：

> glmcgs <- glm(Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 + probe_5 + probe_6 + probe_7 + probe_8 + probe_9 + 
                  probe_10 + probe_11 + probe_12 + probe_13 + probe_14 + probe_15, data=cgshort, family = quasibinomial(link = 'logit'), maxit=100)

> summary(glmcgs)

Call:
glm(formula = Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 + 
    probe_5 + probe_6 + probe_7 + probe_8 + probe_9 + probe_10 + 
    probe_11 + probe_12 + probe_13 + probe_14 + probe_15, family = quasibinomial(link = "logit"), 
    data = cgshort, maxit = 100)

Deviance Residuals: 
       Min          1Q      Median          3Q         Max  
-6.227e-06  -3.066e-07   3.076e-06   4.536e-06   6.389e-06  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  23.68423    5.23171   4.527 0.001932 ** 
probe_1       0.41584    1.03539   0.402 0.698471    
probe_2      -0.88243    0.80631  -1.094 0.305630    
probe_3      -1.14642    0.60525  -1.894 0.094819 .  
probe_4       0.08650    1.64350   0.053 0.959314    
probe_5      -1.46564    1.38381  -1.059 0.320469    
probe_6      -0.72839    1.35910  -0.536 0.606580    
probe_7       2.59539    0.48714   5.328 0.000704 ***
probe_8       2.03890    1.43339   1.422 0.192700    
probe_9       0.87683    1.52469   0.575 0.581041    
probe_10      1.79828    0.80940   2.222 0.057028 .  
probe_11      0.66033    0.93300   0.708 0.499195    
probe_12    -14.75184    2.98871  -4.936 0.001141 ** 
probe_13      3.30891    1.31239   2.521 0.035737 *  
probe_14      0.36376    0.99582   0.365 0.724368    
probe_15     -0.03516    0.91771  -0.038 0.970375    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 6.771977e-11)

    Null deviance: 2.6992e+01  on 23  degrees of freedom
Residual deviance: 3.9885e-10  on  8  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 24

PS：之所以在这里使用准二项式是因为在“肿瘤”样本中，有2个不同的阶段。但是，它们之间的甲基化水平没有统计学差异（先前的分析已经完成）。

最后是带有AUC的ROC曲线：

> roc.final <- roc(cgshort$Tumor, fitted(glmcgs), smooth=FALSE)

Call:
roc.default(response = cgshort$Tumor, predictor = fitted(glmcgs),     smooth = FALSE)

Data: fitted(glmcgs) in 6 controls (cgshort$Tumor No) < 18 cases (cgshort$Tumor Yes).
Area under the curve: 1

我的猜测是，因为样本量不够大，这也可以解释高标准误。会是这样吗？还有没有办法评估这种样本中那些潜在预测因素的效率？

非常欢迎任何帮助。谢谢！

Answer 1

如果模型成功地将成功与失败分开，则AUC将为1。您只有很少的数据和许多预测因子，其中一些对预测结果非常有效，因此在此模型显示出确定性和ROC不足为奇曲线是正方形。

让我猜：您使用了maxit参数，因为该模型无法收敛。这意味着该解决方案不可靠。要获得一个，您可以使用广义LASSO或其他某种正则化。

顺便说一下，这是一个统计问题。

AUC = 1的可能原因（来自拟合的glm模型？

问题描述投票：0回答：1

1个回答

最新问题

AUC = 1的可能原因（来自拟合的glm模型？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1