RFE 与 GBM 的集成,用于特征选择和超参数调整

问题描述 投票:0回答:1

我叫 Lucas,对机器学习领域比较陌生。我在一些在线文档和教程的帮助下编写了这段代码。但是,我需要一些帮助来了解 RFE() 与 GBM() 的集成是否正确。

def evaluateAlgorithm(X_train, X_test, y_train, y_test, dataset):
    Kfold = StratifiedKFold(n_splits=20, shuffle=True)

    GBM = GradientBoostingClassifier(
        loss='log_loss', learning_rate=0.01,
        n_estimators=1000, subsample=0.9,
        min_samples_split=2, min_samples_leaf=1,
        min_weight_fraction_leaf=0.0, max_depth=8,
        init=None, random_state=None,
        max_features=None, verbose=0,
        max_leaf_nodes=None, warm_start=False)

    pipeline = Pipeline(steps=[['feature_selection', RFE(GBM)], ['model', GBM]])

    parameters = {'model__learning_rate': [0.01, 0.02, 0.03],
                  'model__subsample': [0.9, 0.5, 0.3, 0.1],
                  'model__n_estimators': [100, 500, 1000],
                  'model__max_depth': [1, 2, 3],
                  'feature_selection__n_features_to_select': [7, 14, 27]}

    grid_GBM = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=Kfold,
                            verbose=1, n_jobs=-1, refit=True, scoring='accuracy')
    grid_GBM.fit(X_train, y_train)

    print("\n=========================================================================")
    print(" Results from Grid Search Gradient Boosting")
    print("=========================================================================")
    print("\n The best estimator across ALL searched params: \n", grid_GBM.best_estimator_)
    print("\n The best score across ALL searched params: \n", grid_GBM.best_score_)
    print("\n The best parameters across ALL searched params: \n", grid_GBM.best_params_)
    print("\n=========================================================================")

    # Obtain features selected by RFE
    rfe_selected_features_indices = grid_GBM.best_estimator_['feature_selection'].support_
    rfe_selected_features_names = X_train.columns[rfe_selected_features_indices]
    print("Features selected by RFE:", rfe_selected_features_names)

    model_GBM = grid_GBM.best_estimator_

    # Cross-validation
    cv_results_GBM = cross_val_score(model_GBM, X_train, y_train, cv=Kfold, scoring='accuracy', n_jobs=-1, verbose=0)

    print()
    print("Cross Validation results Gradient Boosting: ", cv_results_GBM)
    prt_string = "CV Mean accuracy: %f (Std: %f)" % (cv_results_GBM.mean(), cv_results_GBM.std())
    print(prt_string)

    trained_Model_GBM = model_GBM.fit(X_train, y_train)

    print();
    print('========================================================')
    print();
    print(trained_Model_GBM.get_params(deep=True))
    print();
    print('=========================================================')

    # Make predictions on the test set
    pred_Labels_GBM = trained_Model_GBM.predict(X_test)
    pred_proba_GBM = trained_Model_GBM.predict_proba(X_test)

    # Evaluate performance
    print();
    print('Evaluation of the trained model Gradient Boosting: ')
    accuracy = accuracy_score(y_test, pred_Labels_GBM)
    print();
    print('Accuracy Gradient Boosting: ', accuracy)
    precision = precision_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('Precision Gradient Boosting: ', precision)
    recall = recall_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('Recall Score Gradient Boosting: ', recall)
    f1 = f1_score(y_test, pred_Labels_GBM, pos_label='positive')
    print();
    print('f1 Score Gradient Boosting: ', f1)
    confusion_mat = confusion_matrix(y_test, pred_Labels_GBM)
    classReport = classification_report(y_test, pred_Labels_GBM)
    print();
    print('Classification Report Gradient Boosting: \n', classReport)
    kappa_score = cohen_kappa_score(y_test, pred_Labels_GBM)
    print();
    print('Kappa Score Gradient Boosting: ', kappa_score)

    skplt.estimators.plot_learning_curve(model_GBM, X_train, y_train, figsize=(8, 6))
    plt.show()

    skplt.metrics.plot_roc(y_test, pred_proba_GBM, figsize=(8, 6));
    plt.show()

    skplt.metrics.plot_confusion_matrix(y_test, pred_Labels_GBM, figsize=(8, 6));
    plt.show()

    skplt.metrics.plot_precision_recall(y_test, pred_proba_GBM,
                                        title='Precision-Recall Curve', plot_micro=True,
                                        classes_to_plot=None, ax=None, figsize=(8, 6),
                                        cmap='nipy_spectral', title_fontsize='large',
                                        text_fontsize='medium');
    plt.show()


evaluateAlgorithm(X_train, X_test, y_train, y_test, dataset)

我的目标是使用 RFE 找到最佳的特征组合,同时使用 GBM 的网格搜索来寻找最佳的超参数。然而,RFE 似乎只是在网格搜索的超参数之前找到最好的特征。我该如何解决这个问题以便两个过程同时发生?我们的想法是实现这两个标准的最佳组合。此外,您对改进此代码有什么建议吗?

python machine-learning scikit-learn feature-selection gbm
1个回答
0
投票

如所写,您的代码调整最终模型的超参数,而不是特征选择步骤中 GBM 的超参数。有几个选择:

  1. 扩展搜索空间以包含选择GBM的超参数,例如

    feature_selection__estimator__max_depth

  2. 删除模型步骤。

    RFE
    可以访问所选特征集 (
    estimator_
    ) 上的最终模型,并且您可能需要的方法可直接从
    RFE
    对象(例如
    rfe.predict
    )获得。那么只需按照上面的方式修改超参数的名称即可。

这些方法之间的区别在于,第一种方法允许选择 GBM 具有与模型 GBM 不同的超参数。这往往会更昂贵,但更灵活。如果它提供了显着的改进,我个人会感到惊讶,所以我建议第二种方法,除非您有时间并且愿意进行实验。

© www.soinside.com 2019 - 2024. All rights reserved.