我在 scikit-learn 中使用 RFECV 进行特征选择。我想使用 log(y) 运行 XGBoost 模型,因为我已经能够证明它的性能比仅 y 更好。
无需转换的简单模型:没问题,RFECV 工作正常,我可以获得特征数量。
对数转换模型=问题:我有一个错误说:
“ValueError:输入 X 包含 NaN;RFECV 不接受缺失 原生编码为 NaN 的值。对于监督学习,你可能 想要考虑 sklearn.ensemble.HistGradientBoostingClassifier 和 接受原生编码为 NaN 的缺失值的回归器。 或者,可以预处理数据,例如通过 在管道中使用输入变压器或滴下样品 缺失值。看 https://scikit-learn.org/stable/modules/impute.html 你可以找到一个 下页列出了处理 NaN 值的所有估计器: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values”
我不明白的是,简单模型没有 NaN 问题,但对数转换模型却有 NaN 问题。我的目标 y 中没有 NaN。
如何解决我的问题并能够使用日志转换目标运行 RFECV?
# Base estimator
rs = 45
xgboost_reg = xgb.XGBRegressor(random_state = rs,
grow_policy = "depthwise",
booster = "gbtree", # gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
tree_method = "auto", # pick best option between hist, exact and approx
n_estimators = randint(300,500).rvs(random_state = rs),
subsample = uniform(0.5, 0.5).rvs(random_state = rs),
max_depth = randint(3,10).rvs(random_state = rs),
learning_rate = loguniform(0.05, 0.2).rvs(random_state = rs),
colsample_bytree = uniform(0.5, 0.5).rvs(random_state = rs),
min_child_weight = randint(1,20).rvs(random_state = rs),
gamma = uniform(0.5, 1).rvs(random_state = rs),
reg_alpha = uniform(0.0, 1.0).rvs(random_state = rs),
reg_lambda = uniform(0.0, 1.0).rvs(random_state = rs),
max_delta_step = randint(1,10).rvs(random_state = rs)
)
# RFECV settings
n_features = 89
step = 20
n_scores = 2
min_features_to_select = 9
# Simple model = working
rfecv = RFECV(
xgboost_reg,
step=step,
cv=4,
scoring="neg_root_mean_squared_error",
min_features_to_select= min_features_to_select,
n_jobs=-1,
)
rfecv.fit(x, y)
print(rfecv.n_features_)
# Log-transformed model = error
log_estimator = TransformedTargetRegressor(regressor=xgboost_reg,
func=np.log,
inverse_func=np.exp)
rfecv_log = RFECV(
estimator= log_estimator,
step=step,
cv=4,
scoring="neg_root_mean_squared_error",
min_features_to_select= min_features_to_select,
n_jobs=-1,
)
rfecv_log.fit(x, y)
print(rfecv_log.n_features_)