Roc曲线和切断点。蟒蛇

问题描述 投票:28回答:4

我运行了逻辑回归模型,并对logit值进行了预测。我用它来获得ROC曲线上的点数:

 from sklearn import metrics
 fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

我知道metrics.roc_auc_score给出了ROC曲线下的面积。谁能告诉我什么命令会找到最佳截止点(阈值)?

python logistic-regression roc
4个回答
35
投票

虽然回答很晚,但思想可能会有所帮助。您可以使用R epi中的(here!)包来完成此操作,但是我在python中找不到类似的包或示例。

最佳截止点将是true positive rate高且false positive rate低的地方。基于这个逻辑,我在下面举了一个例子来找到最佳阈值。

Python code:

import pandas as pd
import statsmodels.api as sm
import pylab as pl
import numpy as np
from sklearn.metrics import roc_curve, auc

# read the data in
df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")

# rename the 'rank' column because there is also a DataFrame method called 'rank'
df.columns = ["admit", "gre", "gpa", "prestige"]
# dummify rank
dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige')
# create a clean data frame for the regression
cols_to_keep = ['admit', 'gre', 'gpa']
data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':])

# manually add the intercept
data['intercept'] = 1.0

train_cols = data.columns[1:]
# fit the model
result = sm.Logit(data['admit'], data[train_cols]).fit()
print result.summary()

# Add prediction to dataframe
data['pred'] = result.predict(data[train_cols])

fpr, tpr, thresholds =roc_curve(data['admit'], data['pred'])
roc_auc = auc(fpr, tpr)
print("Area under the ROC curve : %f" % roc_auc)

####################################
# The optimal cut off would be where tpr is high and fpr is low
# tpr - (1-fpr) is zero or near to zero is the optimal cut off point
####################################
i = np.arange(len(tpr)) # index for df
roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)})
roc.ix[(roc.tf-0).abs().argsort()[:1]]

# Plot tpr vs 1-fpr
fig, ax = pl.subplots()
pl.plot(roc['tpr'])
pl.plot(roc['1-fpr'], color = 'red')
pl.xlabel('1-False Positive Rate')
pl.ylabel('True Positive Rate')
pl.title('Receiver operating characteristic')
ax.set_xticklabels([])

最佳截止点为0.317628,因此高于此值的任何值都可以标记为1,否则为0.您可以从输出/图表中看到tpr与1-fpr交叉的位置,tpr为63%,fpr为36%,tpr-( 1-fpr)在当前示例中最接近零。

Output:

        1-fpr       fpr        tf     thresholds       tpr
  171  0.637363  0.362637  0.000433    0.317628     0.637795

enter image description here

希望这是有帮助的。

Edit

为了简化和引入可重用性,我已经找到了找到最佳概率截止点的函数。

Python Code:

def Find_Optimal_Cutoff(target, predicted):
    """ Find the optimal probability cutoff point for a classification model related to event rate
    Parameters
    ----------
    target : Matrix with dependent or target data, where rows are observations

    predicted : Matrix with predicted data, where rows are observations

    Returns
    -------     
    list type, with optimal cutoff value

    """
    fpr, tpr, threshold = roc_curve(target, predicted)
    i = np.arange(len(tpr)) 
    roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)})
    roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]]

    return list(roc_t['threshold']) 


# Add prediction probability to dataframe
data['pred_proba'] = result.predict(data[train_cols])

# Find optimal probability threshold
threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba'])
print threshold
# [0.31762762459360921]

# Find prediction to the dataframe applying threshold
data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0)

# Print confusion Matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(data['admit'], data['pred'])
# array([[175,  98],
#        [ 46,  81]])

14
投票

给定tpr,fpr,来自问题的阈值,最佳阈值的答案就是:

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]

8
投票

香草Python实现Youden的J-Score

def cutoff_youdens_j(fpr,tpr,thresholds):
    j_scores = tpr-fpr
    j_ordered = sorted(zip(j_scores,thresholds))
    return j_ordered[-1][1]

3
投票

cgnorthcutt的帖子

给定tpr,fpr,来自问题的阈值,最佳阈值的答案就是:

optimal_idx = np.argmax(tpr - fpr)optimal_threshold = thresholds [optimal_idx]

几乎是正确的。必须采用abs值。

optimal_idx = np.argmin(np.abs(tpr - fpr)) // Edit: Change to argmin!
optimal_threshold = thresholds[optimal_idx]

根据提到的参考 - > http://www.medicalbiostatistics.com/roccurve.pdf p.6我发现了另一种可能性:

opt_idx = np.argmin(np.sqrt(np.square(1-tpr)+ np.square(fpr)))

© www.soinside.com 2019 - 2024. All rights reserved.