计算交叉验证的AUC 95 % CI (Python, sklearn)

Question

我正在寻找正确的方法来计算我的5倍CV的AUC 95 % CI。

我的训练数据集的n = 81

因此，如果我应用5倍的CV，等于在测试组的每一个褶皱中的平均数约.n = 16。

下面是我的Python代码。

folds = 5
seed = 42

# Grid Search
fit_intercept=[True, False]
C = [np.arange(1,41,1)]
penalty = ['l1', 'l2']
params = dict(C=C, fit_intercept = fit_intercept, penalty = penalty)

logreg = LogisticRegression(random_state = seed)

logreg_grid = GridSearchCV(logreg, param_grid = params , cv=folds, scoring='roc_auc',  iid='False')

# fit the grid with data
logreg_grid.fit(X_train, y_train)

# fit best estimator
logreg = logreg_grid.best_estimator_

# Calculate AUC in 5-fold Stratified CV
logreg_scores = cross_val_score(logreg, X_train, y_train, cv=folds, scoring='roc_auc')
print('LogReg:',logreg_scores.mean())

# LogReg Scores: [0.95714286, 0.85, 0.98333333, 0.85, 0.56666667]  
# Mean: 0.8414285714285714````

#AUC from LogReg = 0.8414

#Three ways I have tried to calculate the 95 % CI:

#LogReg Scores: [0.95714286, 0.85, 0.98333333, 0.85, 0.56666667]  
# Mean: 0.8414285714285714


                    ### First try ###
import statsmodels.stats.api as sms
conf = sms.DescrStatsW(logreg_scores).tconfint_mean(.05)
print(conf)

#Out: Lower 0.636, Upper: 1.047

                    ### Second Try ###
import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2, n-1)
    return m, m-h, m+h


mean_confidence_interval(logreg_scores, confidence=0.95)

#Out: Mid: 0.84, Lower: 0.64, Upper: 1.05)

                      ### Third ###
# interval = t * np.sqrt( (AUC * (1 - AUC)) / n)
# n = 16 (validation set), because the mean in of alle 5 folds is 16 aof my n = 81
# t = 2.120 (Source: https://www.sjsu.edu/faculty/gerstman/StatPrimer/t-table.pdf)

interval = 2.120 * np.sqrt( (0.8414285714285714 * (1 - 0.8414285714285714)) / 16)
print((.84 + interval)*100)
print(.84)
print((.84 - interval)*100)
print(interval)

# Output: Lower: 64.64 , Mid: 0.84, Upper: 103.36 , Interval: 0.194

我的问题。所有的结果看起来都差不多但是，我做错了什么，因为我不明白AUC怎么会是> 1.0？

谢谢你的指点，我期待着你的回答。

干杯，Mischa

Answer 1

我不确定这是否能解决你的问题，但我猜测这是因为你对极小的样本量（n=5）应用t检验。方差大是一种预期，这就是为什么在你的情况下平均数+SD > 1。注意你的三种方法都是基于t检验的。

为了获得足够多的比较，你可能想尝试1）不同子类的多重重复CV或2）bootstrappin。关于CV的一些有用的讨论。关于CV的一些有用的讨论：CV

Answer 2

这是一个非常有用的答案田林河! 谢谢你。

我是这样实现的。

from sklearn.model_selection import RepeatedStratifiedKFold

cv = RepeatedStratifiedKFold(n_splits = 5, n_repeats = 100, random_state = seed)

logreg_scores = cross_val_score(logreg, X_train, y_train, cv=cv, scoring='roc_auc')
print('LogReg:',logreg_scores.mean())


import scipy.stats
def mean_confidence_interval(data, confidence=0.95):
    a = 1.0 * np.array(data)
    n = len(a)
    m, se = np.mean(a), scipy.stats.sem(a)
    h = se * scipy.stats.t.ppf((1 + confidence) / 2, n-1)
    return m, m-h, m+h

mean_confidence_interval(logreg_scores, confidence=0.95)

输出很好，因为现在我有500个AUC。 >>>(0.8014285714285716，0.7921705464185262，0.810686596438617)

但我如何实现这个概率？

y_pred = cross_val_predict(logreg, X_train, y_train, cv=cv, method='predict_proba')

如果我使用上面的代码，它会抛出一个错误。"cross_val_predict只对分区有效"

计算交叉验证的AUC 95 % CI (Python, sklearn)

问题描述投票：0回答：1

1个回答

最新问题

计算交叉验证的AUC 95 % CI (Python, sklearn)

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1