是否有 xgb.XGBRegressor 的示例,其中回调=[early_stop],early_stop=xgb.callback.EarlyStopping 用于 cross_val_predict?

问题描述 投票:0回答:1

文档 XGBClassifier 有一个 EarlyStopping:

```
es = xgboost.callback.EarlyStopping(
    rounds=2,
    min_delta=1e-3,
    save_best=True,
    maximize=False,
    data_name="validation_0",
    metric_name="mlogloss",
    )
clf = xgboost.XGBClassifier(tree_method="hist", device="cuda", callbacks=[es])

X, y = load_digits(return_X_y=True)
clf.fit(X, y, eval_set=[(X, y)])```

但是“validation_0”如何在 clf.fit 中引用

eval_set
来让 EarlyStopping 指标进行评估?

我尝试将其应用到 XGBRegressor:

`import xgboost as xgb
from sklearn.model_selection import cross_val_predict, KFold
import pandas as pd
import numpy as np

class CustomEarlyStopping(xgb.callback.EarlyStopping):
    def __init__(self, rounds=2, min_delta=1e-3, save_best=True, maximize=False, data_name="validation_0", metric_name="rmse"):
        super().__init__(rounds=rounds, min_delta=min_delta, save_best=save_best, maximize=maximize, data_name=data_name, metric_name=metric_name)
    
# TRAIN MODEL (10x10-fold CV)
cvx = KFold(n_splits=10, shuffle=True, random_state=239)
es = CustomEarlyStopping()

model = xgb.XGBRegressor(colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 10, alpha = 10, n_estimators = 500, n_jobs=-1, 
                     random_state=239,callbacks=[es])
model.set_params(tree_method='approx', device="cpu")

cv_preds = []
for i in range(0,10):
    cv_preds.append(cross_val_predict(model, np.asarray(X_train), np.asarray(y_train), cv=cvx, method='predict', n_jobs=1, verbose=2))`

我将 data_name="validation_0" 放在

EarlyStopping
__init__
中,而没有在每个 cv 折叠中命名测试集。 这段代码的行为有什么问题?谢谢。

XGBRegressor 的代码返回此错误:

ValueError: Must have at least 1 validation dataset for early stopping.

应该发生的是 cv_preds 被 10 个预测 y 的 ndarray 填充。

python machine-learning scikit-learn jupyter-notebook xgboost
1个回答
0
投票

来自

scikit-learn
文档(链接):

数据根据cv参数进行分割。每个样本都属于一个测试集,其预测是通过安装在相应训练集上的估计器来计算的。

这些测试集不会传递给估计器的

.fit()
方法,因此您不能将它们用于
xgboost
scikit-learn
估计器提前停止。

在运行

cross_val_predict()
之前,将训练数据的一部分作为验证集,然后使用该验证集在 k
 执行的所有 
cross_val_predict()
训练运行中触发提前停止。

请考虑使用 Python 3.11、

numpy==1.26.4

scikit-learn==1.4.1
xgboost==2.0.3
 的示例。

import numpy as np import xgboost as xgb import sklearn from sklearn.datasets import make_regression from sklearn.metrics import r2_score from sklearn.model_selection import cross_val_predict, KFold, train_test_split # generate synthetic regression training data X, y = make_regression(n_samples=10_000, n_features=7) # reserve 10% of the data as a validation set X_train, X_valid, y_train, y_valid = train_test_split( X, y, train_size=0.9, random_state=708 ) # initialize 3-fold cross-validation splitter cvx = KFold(n_splits=3, shuffle=True, random_state=239) # define a custom metric that never improves def custom_constant_metric(y_true, y_pred): return 0.123 # choose specific values for xgboost early stopping early_stop = xgb.callback.EarlyStopping( rounds=2, min_delta=1e-3, save_best=True, maximize=False, data_name="validation_0", metric_name="custom_constant_metric", ) # enable metadata_routing sklearn.set_config(enable_metadata_routing=True) # configure an XGBoost regressor, and tell scikit-learn's # cross-validation machinery to forward parameter "eval_set" # through to XGBRegressor.fit() model = xgb.XGBRegressor( n_estimators=5, random_state=239, tree_method="approx", callbacks=[early_stop], eval_metric=custom_constant_metric, ).set_fit_request(eval_set=True) # generate predictions from CV splits cv_preds = cross_val_predict( estimator=model, X=np.asarray(X_train), y=np.asarray(y_train), cv=cvx, method="predict", params={"eval_set": [(X_valid, y_valid)]}, verbose=1, ) # evaluate fit r2_score(y_train, cv_preds) # 0.406
请注意,在这个示例中,我使用了 3 倍交叉验证,要求 XGBoost 执行 5 轮 boosting,并要求 XGBoost 在仅 2 轮没有改善的情况下触发提前停止。

来自

cross_val_predict()

 的日志确认这正在按预期工作...它们显示了 3 次训练运行(
KFold
 产生的每个折叠一次),每次仅在 2 轮提升后停止。

[0] validation_0-rmse:106.90484 validation_0-custom_constant_metric:0.12300 [1] validation_0-rmse:84.89786 validation_0-custom_constant_metric:0.12300 [0] validation_0-rmse:107.70702 validation_0-custom_constant_metric:0.12300 [1] validation_0-rmse:85.48965 validation_0-custom_constant_metric:0.12300 [0] validation_0-rmse:108.08369 validation_0-custom_constant_metric:0.12300 [1] validation_0-rmse:86.36046 validation_0-custom_constant_metric:0.12300
使用自定义指标函数只是为了展示其工作原理...替换为 

"rmse"

 或您在实际应用程序中想要的任何评估指标。

sklearn.set_config()

.set_fit_request()
的必要性来自于
scikit-learn
 1.4中引入的更改。有关详细信息,请参阅 
scikit-learn
 文档 (
link) 中的“元数据路由”。

© www.soinside.com 2019 - 2024. All rights reserved.