RandomForestClassifier GridSearchCV 和显式编码的 RandomForestCLassifier 之间的 roc_auc_score 不同

Question

为什么具有特定参数的经过训练的

RandomForestClassifier

无法与使用

GridSearchCV

改变这些参数的性能相匹配？

def random_forest(X_train, y_train):
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import roc_auc_score, make_scorer
    from sklearn.model_selection import train_test_split
    
    X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, random_state=0)
    
    # various combinations of max depth and max features
    max_depth_vals = [1,2,3]
    max_features_vals = [2,3,4]
    grid_values = {'max_depth': max_depth_vals, 'max_features': max_features_vals}
    
    # build GridSearch
    clf = RandomForestClassifier(n_estimators=10)
    grid = GridSearchCV(clf, param_grid=grid_values, cv=3, scoring='roc_auc')
    grid.fit(X_train, y_train)
    y_hat_proba = grid.predict_proba(X_validate)
    print('Train Grid best parameter (max. AUC): ', grid.best_params_)
    print('Train Grid best score (AUC): ', grid.best_score_)
    print('Validation set AUC: ', roc_auc_score(y_validate, y_hat_proba[:,1]))

    
    # build RandomForest with hard coded values. AUC should be ballpark to grid search
    clf = RandomForestClassifier(max_depth=3, max_features=4, n_estimators=10)
    clf.fit(X_train, y_train)
    y_hat = clf.predict(X_validate)
    y_hat_prob = clf.predict_proba(X_validate)[:, 1]
    
    auc = roc_auc_score(y_hat, y_hat_prob)
    
    print("\nMax Depth: 3 Max Features: 4\n---------------------------------------------")
    print("auc: {}".format(auc))
    return

结果 - 网格搜索识别

max_depth=3

和

max_features=4

的最佳参数，并计算

roc_auc_score

的

0.85

；当我将其通过保留验证集的代码时，我得到

roc_auc_score

的

0.84

。然而，当我直接使用这些参数对分类器进行编码时，它会计算出

roc_auc_score

的

1.0

。我的理解是，它应该在同一个范围内~0.85，但这感觉很遥远。

Validation set AUC:  0.8490471073563559
Grid best parameter (max. AUC):  {'max_depth': 3, 'max_features': 4}
Grid best score (AUC):  0.8599727094965482

Max Depth: 3 Max Features: 4
---------------------------------------------
auc: 1.0

我可能会误解概念，无法正确应用技术，甚至存在编码问题。谢谢。

Answer 1

有2个问题：

可变性

为了获得可重现的结果，请尽可能指定种子或随机状态，例如

RandomForestClassifier(n_estimators=10, random_state=1234)

cv = StratifiedKFold(n_splits=3, random_state=1234)
GridSearchCV(clf, param_grid=grid_values, cv=cv, scoring='roc_auc')

ROC-AUC计算参数

您使用估计标签而不是真实标签：

auc = roc_auc_score(y_hat, y_hat_prob)

使用真正的标签：

auc = roc_auc_score(y_validate, y_hat_prob)

RandomForestClassifier GridSearchCV 和显式编码的 RandomForestCLassifier 之间的 roc_auc_score 不同

问题描述投票：0回答：1

1个回答

可变性

ROC-AUC计算参数

最新问题

RandomForestClassifier GridSearchCV 和显式编码的 RandomForestCLassifier 之间的 roc_auc_score 不同

问题描述 投票：0回答：1

1个回答

可变性

ROC-AUC计算参数

最新问题

问题描述投票：0回答：1