目标转换和特征选择。 ValueError:输入 X 包含 NaN

问题描述 投票:0回答:1

我在 scikit-learn 中使用 RFECV 进行特征选择。我想使用 log(y) 运行 XGBoost 模型,因为我已经能够证明它的性能比仅 y 更好。

无需转换的简单模型:没问题,RFECV 工作正常,我可以获得特征数量。

对数转换模型=问题:我有一个错误说:

“ValueError:输入 X 包含 NaN;RFECV 不接受缺失 原生编码为 NaN 的值。对于监督学习,你可能 想要考虑 sklearn.ensemble.HistGradientBoostingClassifier 和 接受原生编码为 NaN 的缺失值的回归器。 或者,可以预处理数据,例如通过 在管道中使用输入变压器或滴下样品 缺失值。看 https://scikit-learn.org/stable/modules/impute.html 你可以找到一个 下页列出了处理 NaN 值的所有估计器: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values”

我不明白的是,简单模型没有 NaN 问题,但对数转换模型却有 NaN 问题。我的目标 y 中没有 NaN。

如何解决我的问题并能够使用日志转换目标运行 RFECV?

# Base estimator
rs = 45
xgboost_reg = xgb.XGBRegressor(random_state = rs, 
                                grow_policy = "depthwise", 
                                booster = "gbtree", # gblinear or dart; gbtree and dart use tree based models while gblinear uses linear functions.
                                tree_method = "auto", # pick best option between hist, exact and approx
                                n_estimators = randint(300,500).rvs(random_state = rs),
                                subsample = uniform(0.5, 0.5).rvs(random_state = rs),
                                max_depth = randint(3,10).rvs(random_state = rs),
                                learning_rate = loguniform(0.05, 0.2).rvs(random_state = rs),
                                colsample_bytree = uniform(0.5, 0.5).rvs(random_state = rs),
                                min_child_weight =  randint(1,20).rvs(random_state = rs),
                                gamma = uniform(0.5, 1).rvs(random_state = rs),
                                reg_alpha = uniform(0.0, 1.0).rvs(random_state = rs),
                                reg_lambda = uniform(0.0, 1.0).rvs(random_state = rs),
                                max_delta_step = randint(1,10).rvs(random_state = rs)
)

# RFECV settings
n_features = 89
step = 20
n_scores = 2
min_features_to_select = 9

# Simple model = working
rfecv = RFECV(
    xgboost_reg,
    step=step,
    cv=4,
    scoring="neg_root_mean_squared_error",
    min_features_to_select= min_features_to_select,
    n_jobs=-1, 
)
rfecv.fit(x, y)
print(rfecv.n_features_)

# Log-transformed model = error
log_estimator = TransformedTargetRegressor(regressor=xgboost_reg,
                                             func=np.log,
                                             inverse_func=np.exp)
rfecv_log = RFECV(
    estimator= log_estimator,
    step=step,
    cv=4,
    scoring="neg_root_mean_squared_error",
    min_features_to_select= min_features_to_select,
    n_jobs=-1, 
)
rfecv_log.fit(x, y)
print(rfecv_log.n_features_)
python scikit-learn cross-validation feature-selection rfe
1个回答
0
投票

检查数据中目标值是否为负值。

np.log()
对于负数产生 nan 值。 请参阅此相关的问题,了解对数转换目标回归量中除以零的错误。

© www.soinside.com 2019 - 2024. All rights reserved.