我正在运行高通量微阵列数据(甲基化阵列),并且在运行了单变量,套索和交叉验证套索分析之后,我能够得出15个探针(预测变量)的列表。
现在,我想运行ROC / AUC曲线,以检查那些预测变量实际上是否是好的候选变量。问题在于从中得出的结果是AUC = 1的ROC曲线。我一直在尝试抽动拟合的模型(即family
和maxit
),但结果没有改变。
这里是数据样本(带有8个预测变量)和下游分析,并带有一些解释:
Tumor probe_1 probe_2 probe_3 probe_4 probe_5 probe_6 probe_7 probe_8
Benign.A4 No -5.076257 -3.18658187 -2.91627872 -3.2393655 -2.4080861 -3.9414602 -4.5844204 -2.96877633
Benign.A1 No -3.232952 -2.21518181 0.71340947 -2.1103999 -1.4563154 -4.0614544 -2.9378821 -0.90468942
Benign.C2 No -4.487701 -3.34515435 -5.35341349 -2.0355878 -2.9573763 -4.2980546 -4.3421487 -2.35597830
Benign.C8 No -3.692610 -1.24332686 -0.59115736 -3.4852858 -2.3339160 -3.1302782 -3.0943430 -1.03581249
Benign.D7 No -2.978757 -0.05097524 0.02744634 -1.4946543 -1.5593915 -2.8860660 -2.7633458 -0.99299595
Benign.D3 No -2.441925 -1.98227873 -2.13478645 -3.0265593 -2.7789079 -3.9860489 -2.8512663 -2.61804934
Tumor.A6 Yes 1.044348 -5.85637090 -4.49697162 1.5033139 0.3226736 1.5937440 -0.4881769 0.95135529
Tumor.A5 Yes 1.749187 -2.93393903 -5.54439148 2.4403760 1.6238294 -1.1699169 3.0410728 1.07437064
Tumor.A2 Yes 2.323806 -6.57693143 -5.78690184 1.7684931 2.3522317 0.3517146 -1.9972320 1.46663990
Tumor.C1 Yes 2.229316 -6.69010615 -6.22036584 0.7482678 1.3277280 0.6128029 1.3349142 1.63602050
Tumor.C6 Yes 2.888489 -5.79079519 -5.02991621 1.4605461 1.3002248 1.1498193 0.4481215 0.81473797
Tumor.C5 Yes -1.861726 -5.14400193 -5.26197761 1.0023323 0.8582683 0.5492184 0.6720438 1.73785369
Tumor.D1 Yes 2.776804 -6.78537165 -6.20280759 2.0623420 1.8291220 1.7328508 1.3667038 1.77813837
Tumor.D6 Yes 2.985209 -6.13405436 -5.92181030 1.8801728 1.1815045 2.2210693 0.1363381 2.21102559
Tumor.D8 Yes 1.670136 -6.72855542 -6.61156537 1.9847271 1.6267041 -2.8621148 0.7134887 -0.56794735
Tumor.A3 Yes 2.106628 -5.61286600 -5.75976883 2.1291475 0.5839721 1.4210874 1.2746626 1.77239233
Tumor.A8 Yes 1.798005 -5.53405698 -5.34042037 3.0262657 1.2199790 1.2448107 1.2297283 0.25649834
Tumor.A7 Yes 1.798074 -6.03775348 -5.01964376 1.2428083 2.3899569 0.6292222 0.6439477 0.92047002
Tumor.C3 Yes 1.542737 -6.54219832 -5.94287577 1.6111676 2.1889028 0.1228641 0.7950770 1.38000135
Tumor.C7 Yes 3.369420 -6.84809093 -5.88474727 2.7525838 3.2090893 1.1435739 1.2199450 0.89089956
Tumor.C4 Yes 3.179484 -6.59432541 -5.68920298 2.4093288 2.3173752 -0.3378846 1.3653768 0.66432101
Tumor.D5 Yes 2.328382 -6.41234621 -6.18003184 -0.1768171 2.1202506 2.4287615 1.7804487 0.08098025
Tumor.D4 Yes 3.051829 -7.01875245 -6.32614849 1.4200916 2.3582254 2.4981644 1.7878118 1.14826500
Tumor.D2 Yes 2.686846 -3.57625801 -6.25573666 1.6330575 0.8448418 1.4229245 -0.6461006 0.09491185
glm分析:
> glmcgs <- glm(Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 + probe_5 + probe_6 + probe_7 + probe_8 + probe_9 +
probe_10 + probe_11 + probe_12 + probe_13 + probe_14 + probe_15, data=cgshort, family = quasibinomial(link = 'logit'), maxit=100)
> summary(glmcgs)
Call:
glm(formula = Tumor ~ probe_1 + probe_2 + probe_3 + probe_4 +
probe_5 + probe_6 + probe_7 + probe_8 + probe_9 + probe_10 +
probe_11 + probe_12 + probe_13 + probe_14 + probe_15, family = quasibinomial(link = "logit"),
data = cgshort, maxit = 100)
Deviance Residuals:
Min 1Q Median 3Q Max
-6.227e-06 -3.066e-07 3.076e-06 4.536e-06 6.389e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.68423 5.23171 4.527 0.001932 **
probe_1 0.41584 1.03539 0.402 0.698471
probe_2 -0.88243 0.80631 -1.094 0.305630
probe_3 -1.14642 0.60525 -1.894 0.094819 .
probe_4 0.08650 1.64350 0.053 0.959314
probe_5 -1.46564 1.38381 -1.059 0.320469
probe_6 -0.72839 1.35910 -0.536 0.606580
probe_7 2.59539 0.48714 5.328 0.000704 ***
probe_8 2.03890 1.43339 1.422 0.192700
probe_9 0.87683 1.52469 0.575 0.581041
probe_10 1.79828 0.80940 2.222 0.057028 .
probe_11 0.66033 0.93300 0.708 0.499195
probe_12 -14.75184 2.98871 -4.936 0.001141 **
probe_13 3.30891 1.31239 2.521 0.035737 *
probe_14 0.36376 0.99582 0.365 0.724368
probe_15 -0.03516 0.91771 -0.038 0.970375
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 6.771977e-11)
Null deviance: 2.6992e+01 on 23 degrees of freedom
Residual deviance: 3.9885e-10 on 8 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 24
PS:之所以在这里使用准二项式是因为在“肿瘤”样本中,有2个不同的阶段。但是,它们之间的甲基化水平没有统计学差异(先前的分析已经完成)。
最后是带有AUC的ROC曲线:
> roc.final <- roc(cgshort$Tumor, fitted(glmcgs), smooth=FALSE)
Call:
roc.default(response = cgshort$Tumor, predictor = fitted(glmcgs), smooth = FALSE)
Data: fitted(glmcgs) in 6 controls (cgshort$Tumor No) < 18 cases (cgshort$Tumor Yes).
Area under the curve: 1
我的猜测是,因为样本量不够大,这也可以解释高标准误。会是这样吗?还有没有办法评估这种样本中那些潜在预测因素的效率?
非常欢迎任何帮助。谢谢!
如果模型成功地将成功与失败分开,则AUC将为1。您只有很少的数据和许多预测因子,其中一些对预测结果非常有效,因此在此模型显示出确定性和ROC不足为奇曲线是正方形。
让我猜:您使用了maxit
参数,因为该模型无法收敛。这意味着该解决方案不可靠。要获得一个,您可以使用广义LASSO或其他某种正则化。
顺便说一下,这是一个统计问题。