如何在 XGBRegressor 的 MultiOutputRegressor 上使用验证集?

问题描述 投票:0回答:2

我正在使用以下 MultiOutputRegressor:

from xgboost import XGBRegressor
from sklearn.multioutput import MultiOutputRegressor

#Define the estimator
estimator = XGBRegressor(
    objective = 'reg:squarederror'
    )

# Define the model
my_model = MultiOutputRegressor(estimator = estimator, n_jobs = -1).fit(X_train, y_train)

我想使用验证集来评估我的 XGBRegressor 的性能,但是我相信

MultiOutputRegressor
不支持将
eval_set
传递给拟合函数。

在这种情况下如何使用验证集?是否有任何解决方法可以调整 XGBRegressor 以具有多个输出?

python validation machine-learning regression xgboost
2个回答
3
投票

您可以尝试像这样编辑

fit
对象的
MultiOutputRegressor
方法:

from sklearn.utils.validation import _check_fit_params
from sklearn.base import is_classifier
from sklearn.utils.fixes import delayed
from joblib import Parallel
from sklearn.multioutput import _fit_estimator

class MyMultiOutputRegressor(MultiOutputRegressor):
    
    def fit(self, X, y, sample_weight=None, **fit_params):
        """ Fit the model to data.
        Fit a separate model for each output variable.
        Parameters
        ----------
        X : {array-like, sparse matrix} of shape (n_samples, n_features)
            Data.
        y : {array-like, sparse matrix} of shape (n_samples, n_outputs)
            Multi-output targets. An indicator matrix turns on multilabel
            estimation.
        sample_weight : array-like of shape (n_samples,), default=None
            Sample weights. If None, then samples are equally weighted.
            Only supported if the underlying regressor supports sample
            weights.
        **fit_params : dict of string -> object
            Parameters passed to the ``estimator.fit`` method of each step.
            .. versionadded:: 0.23
        Returns
        -------
        self : object
        """

        if not hasattr(self.estimator, "fit"):
            raise ValueError("The base estimator should implement"
                             " a fit method")

        X, y = self._validate_data(X, y,
                                   force_all_finite=False,
                                   multi_output=True, accept_sparse=True)

        if is_classifier(self):
            check_classification_targets(y)

        if y.ndim == 1:
            raise ValueError("y must have at least two dimensions for "
                             "multi-output regression but has only one.")

        if (sample_weight is not None and
                not has_fit_parameter(self.estimator, 'sample_weight')):
            raise ValueError("Underlying estimator does not support"
                             " sample weights.")

        fit_params_validated = _check_fit_params(X, fit_params)
        [(X_test, Y_test)] = fit_params_validated.pop('eval_set')
        self.estimators_ = Parallel(n_jobs=self.n_jobs)(
            delayed(_fit_estimator)(
                self.estimator, X, y[:, i], sample_weight,
                **fit_params_validated, eval_set=[(X_test, Y_test[:, i])])
            for i in range(y.shape[1]))
        return self

然后将

eval_set
传递给
fit
方法:

fit_params = dict(
        eval_set=[(X_test, Y_test)], 
        early_stopping_rounds=10
        )
model.fit(X_train, Y_train, **fit_params)

0
投票

通过进行一些小的编辑/更改,@itamar-kanter 的解决方案对我有用。评论有点长,所以最好写成答案而不是评论。

注意到@itamar-kanter的解决方案可能从darts.utils.multioutput.MultiOutputRegressor的fit()函数中获得灵感:unit8co.github.io/darts/_modules/darts/utils/multioutput.html

  1. 这一行有一个错字,[(X_test, Y_test)] = fit_params_validated.pop('eval_set'),应该是这样的:

     [X_test, Y_test] = fit_params_validated.pop('eval_set')
    

    即使用 [X_test, Y_test] 或 (X_test, Y_test) 提取验证集的训练和测试数据。

    或者,可以在 darts.utils.multioutput.MultiOutputRegressor 中使用与 fix() 相同的语法:

     eval_set = fit_params_validated.pop("eval_set")
    
     self.estimators_ = Parallel(n_jobs=self.n_jobs)(
             delayed(_fit_estimator)(
                 self.estimator,
                 X,
                 y[:, i],
                 sample_weight,
                 # eval set may be a list (for XGBRegressor), in which case we have to keep it as a list
                 eval_set=[(eval_set[0][0], eval_set[0][1][:, i])]
                 if isinstance(eval_set, list)
                 else (eval_set[0], eval_set[1][:, i]),
                 **fit_params_validated
             )
             for i in range(y.shape[1])
    
  2. 导入Pallel和延迟函数的正确方法应该与darts.utils.multioutput.MultiOutputRegressor相同:

     try:
    
          # delayed was moved from sklearn.utils.fixes to sklearn.utils.parallel in v1.3
    
          from sklearn.utils.parallel import Parallel, delayed
    
     except ImportError:
    
          from joblib import Parallel
    
          from sklearn.utils.fixes import delayed
    

    这里,使用try ... except ...导入Pallel和delayed可以避免来自parallel.py的不必要的UserWarning:“sklearn.utils.parallel.delayed应该与sklearn.utils.parallel.Parallel一起使用以使得可以将当前线程的 scikit-learn 配置传播到 joblib 工作人员”。

    同时,使用sklearn的内部函数“from sklearn.utils.parallel import Parallel,delayed”,而不是“from joblib import Parallel,delayed”,可以使ML训练更快。我猜 sklearn 的内部实现有一些独特的功能,可以更好地与并行 sklearn ML 模型配合使用?

  3. 最后,@itamar-kanter 代码中的导入部分缺少“from sklearn.utils.multiclass import check_classification_targets”?

© www.soinside.com 2019 - 2024. All rights reserved.