Sklearn 将 fit() 参数传递给管道中的 xgboost

问题描述 投票:0回答:5

类似于 如何将参数仅传递给 scikit learn 中管道对象的一部分? 我想将参数仅传递给管道的一部分。通常,它应该可以正常工作,例如:

estimator = XGBClassifier()
pipeline = Pipeline([
        ('clf', estimator)
    ])

并像这样执行

pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20)

但失败了:

    /usr/local/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params)
        114         """
        115         Xt, yt, fit_params = self._pre_transform(X, y, **fit_params)
    --> 116         self.steps[-1][-1].fit(Xt, yt, **fit_params)
        117         return self
        118 

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/sklearn.py in fit(self, X, y, sample_weight, eval_set, eval_metric, early_stopping_rounds, verbose)
        443                               early_stopping_rounds=early_stopping_rounds,
        444                               evals_result=evals_result, obj=obj, feval=feval,
    --> 445                               verbose_eval=verbose)
        446 
        447         self.objective = xgb_options["objective"]

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in train(params, dtrain, num_boost_round, evals, obj, feval, maximize, early_stopping_rounds, evals_result, verbose_eval, learning_rates, xgb_model, callbacks)
        201                            evals=evals,
        202                            obj=obj, feval=feval,
    --> 203                            xgb_model=xgb_model, callbacks=callbacks)
        204 
        205 

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/training.py in _train_internal(params, dtrain, num_boost_round, evals, obj, feval, xgb_model, callbacks)
         97                                end_iteration=num_boost_round,
         98                                rank=rank,
    ---> 99                                evaluation_result_list=evaluation_result_list))
        100         except EarlyStopException:
        101             break

    /usr/local/lib/python3.5/site-packages/xgboost-0.6-py3.5.egg/xgboost/callback.py in callback(env)
        196     def callback(env):
        197         """internal function"""
    --> 198         score = env.evaluation_result_list[-1][1]
        199         if len(state) == 0:
        200             init(env)

    IndexError: list index out of range

estimator.fit(X_train, y_train, early_stopping_rounds=20)

效果很好。

python scikit-learn pipeline xgboost keyword-argument
5个回答
17
投票

对于提前停止的轮次,您必须始终指定参数 eval_set 给出的验证集。以下是修复代码中的错误的方法。

pipeline.fit(X_train, y_train, clf__early_stopping_rounds=20, clf__eval_set=[(test_X, test_y)])

14
投票

我最近使用以下步骤来使用 Xgboost 的 eval metric 和 eval_set 参数。

1.使用预处理/特征转换步骤创建管道:

这是由之前定义的管道制成的,其中最后一步包括 xgboost 模型。

pipeline_temp = pipeline.Pipeline(pipeline.cost_pipe.steps[:-1])  

2.安装这条管道

X_trans = pipeline_temp.fit_transform(X_train[FEATURES],y_train)

3.通过将转换应用于测试集来创建您的 eval_set

eval_set = [(X_trans, y_train), (pipeline_temp.transform(X_test), y_test)]

4.将您的 xgboost 步骤添加回管道

 pipeline_temp.steps.append(pipeline.cost_pipe.steps[-1])

5.通过传递参数来安装新管道

pipeline_temp.fit(X_train[FEATURES], y_train,
             xgboost_model__eval_metric = ERROR_METRIC,
             xgboost_model__eval_set = eval_set)

6.如果您愿意,请保留管道。

joblib.dump(pipeline_temp, save_path)

8
投票

这是解决方案:https://www.kaggle.com/c/otto-group-product-classification-challenge/forums/t/13755/xgboost-early-stopping-and-other-issues Early_stooping_rounds 和需要传递监视列表/eval_set。不幸的是,这对我不起作用,因为监视列表上的变量需要一个预处理步骤,该步骤仅应用于管道/我需要手动应用此步骤。


0
投票

这是一个在 GridSearchCV 管道中工作的解决方案:

重写 XGBRegressor 或 XGBClssifier.fit() 函数

  • 这一步使用train_test_split()来选择指定数量的 来自 X 的 eval_set 验证记录,然后传递 剩余记录沿着 fit()。
  • .fit()中添加了一个新参数eval_test_size来控制验证记录的数量。 (参见 train_test_split test_size 文档)
  • **kwargs 传递用户为 XGBRegressor.fit() 函数添加的任何其他参数。
from xgboost.sklearn import XGBRegressor
from sklearn.model_selection import train_test_split

class XGBRegressor_ES(XGBRegressor):
    
    def fit(self, X, y, *, eval_test_size=None, **kwargs):
        
        if eval_test_size is not None:
        
            params = super(XGBRegressor, self).get_xgb_params()
            
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=eval_test_size, random_state=params['random_state'])
            
            eval_set = [(X_test, y_test)]
            
            # Could add (X_train, y_train) to eval_set 
            # to get .eval_results() for both train and test
            #eval_set = [(X_train, y_train),(X_test, y_test)] 
            
            kwargs['eval_set'] = eval_set
            
        return super(XGBRegressor_ES, self).fit(X_train, y_train, **kwargs) 

用法示例

下面是一个多步骤管道,其中包括对 X 的多次转换。管道的 fit() 函数将新的评估参数传递给上面的 XGBRegressor_ES 类,形式为 xgbr__eval_test_size=200。在这个例子中:

  • X_train 包含传递到管道的文本文档。
  • XGBRegressor_ES.fit() 使用 train_test_split() 从 X_train 中选择 200 条记录作为验证集和早期停止。 (这也可以是一个百分比,例如 xgbr__eval_test_size=0.2)
  • X_train 中的剩余记录将传递给 XGBRegressor.fit() 以进行实际的 fit()。
  • 网格搜索中每个 cv 折叠经过 75 轮不变的提升后,现在可能会发生提前停止。
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectPercentile, f_regression
   
xgbr_pipe = Pipeline(steps=[('tfidf', TfidfVectorizer()),
                     ('vt',VarianceThreshold()),
                     ('scaler', StandardScaler()),
                     ('Sp', SelectPercentile()),
                     ('xgbr',XGBRegressor_ES(n_estimators=2000,
                                             objective='reg:squarederror',
                                             eval_metric='mae',
                                             learning_rate=0.0001,
                                             random_state=7))    ])

X_train = train_idxs['f_text'].values
y_train = train_idxs['Pct_Change_20'].values

管道安装示例:

%time xgbr_pipe.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

GridSearchCV 拟合示例:

learning_rate = [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3]
param_grid = dict(xgbr__learning_rate=learning_rate)

grid_search = GridSearchCV(xgbr_pipe, param_grid, scoring="neg_mean_absolute_error", n_jobs=-1, cv=10)
grid_result = grid_search.fit(X_train, y_train, 
                    xgbr__eval_test_size=200,
                    xgbr__eval_metric='mae', 
                    xgbr__early_stopping_rounds=75)

0
投票

假设您有这样的管道:

pipeline = Pipeline([('preprocessor', preprocessor), ('model', xgboost_model)])

您可以使用此函数将

eval_set
传递到
model.fit
步骤:

def pipeline_fit_with_eval_set(pipeline, X_train, y_train, X_test, y_test, fit_params={}):
    """
    Fit a scikit-learn pipeline with eval_set support.

    Parameters:
    - pipeline: The scikit-learn pipeline.
    - X_train: Training data.
    - y_train: Training labels.
    - X_test: Test data.
    - y_test: Test labels.
    - fit_params: Additional fit parameters.
    - pipeline_model_step_name: Name of the model step in the pipeline.

    Usage:
    pipeline_fit_with_eval_set(my_pipeline, X_train, y_train, X_test, y_test, fit_params={'eval_metric': 'logloss'})
    """
    # Step 1: Extract Preprocessors
    pipeline_preprocessors = Pipeline(pipeline.steps[:-1])
    
    # Step 2: Fit preprocessors and Transform Training Data
    # Make sure not to use any test data for the fit step
    X_train_transformed = pipeline_preprocessors.fit_transform(X_train)

    # Step 3: Transform Test Data
    X_test_transformed = pipeline_preprocessors.transform(X_test)

    # Step 4: Prepare Eval Set
    fit_params["eval_set"] = [(X_test_transformed, y_test)]

    # Step 5: Extract Model and Fit
    model = pipeline.steps[-1][1]
    model.fit(X_train_transformed, y_train, **fit_params)

我最初受到这个答案的启发并对其进行了改进。
您可以将此功能用于任何其他需要

eval_set
的模型,例如
LightGBM
CatBoost

© www.soinside.com 2019 - 2024. All rights reserved.