Sklearn 预处理器按顺序工作,但在 Pipeline 中使用时会产生 NA

问题描述 投票:0回答:1

上下文如下:

我正在使用包含各种特征类型(数字、分类)的数据集。 我的任务是根据之前定义的目标变量对启动成功进行二元预测。 我的 HistGradientBoostingClassifier 的 ML 管道中有几个预处理步骤: 对数变换、平方根变换、缩尾化(不同变量的两个级别)、多项式特征创建、正弦变换和标准缩放(每个变换后的特征组具有单独的缩放器)。 分类特征的目标编码。 我使用 TimeSeriesSplit 交叉验证策略与 GridSearchCV 进行逻辑回归超参数调整。我有几个预处理器的原因是我需要以某种方式将产生新列的预处理行为与这些列的缩放结合起来。

问题:

当我运行

log_grid.fit(X_train, y_train)
时,管道会发出遇到 NA 的警告(但不会失败)。但是,如果我在将 X_train 和 y_train 输入到管道之前将预处理器步骤分别应用于 X_train 和 y_train,则一切都会按预期运行(包含 17,000 个观测值和 0 个缺失值的数据集)。以下是我的预处理器:

log_transformer = FunctionTransformer(np.log1p, validate=False)  
sqrt_transformer = FunctionTransformer(np.sqrt, validate=False)
winsorizer_low = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.01, 0.01]}, validate=False)
winsorizer_strong = FunctionTransformer(winsorizer_selfmade, kw_args={'limits': [0.05, 0.05]}, validate=False)
poly_transformer = PolynomialFeatures(degree=2)
sin_transformer = FunctionTransformer(np.sin, validate=False)

normal_scaler = StandardScaler()
scaler_log = StandardScaler()  # Scaler for log-transformed features
scaler_sqrt = StandardScaler()  # Scaler for sqrt-transformed features
scaler_winsor_low = StandardScaler()  # Scaler for winsorized (low) features
scaler_winsor_strong = StandardScaler()  # Scaler for winsorized (strong) features
scaler_poly = StandardScaler()  # Scaler for polynomial features

preprocessor_target = ColumnTransformer(
    transformers=[
        ('target', TargetEncoder(handle_unknown='ignore'), dummy_cols),  # Assuming 'country_code' needs one-hot encoding
        ('log', log_transformer, log_transformer_cols),
        ('sqrt', sqrt_transformer, sqrt_transformer_cols),
        ('winsor_low', winsorizer_low, low_winsor_cols),
        ('winsor_strong', winsorizer_strong, strong_winsor_cols),
        ('poly', poly_transformer, poly_cols),
        ('sin', sin_transformer, sin_transformer_cols),
        ('normal_scale', normal_scaler, normal_scale_cols)
    ],
    remainder='passthrough'  # Include columns that are not specified without any transformations
)

preprocessor_target.set_output(transform='pandas')


preprocessor_scaling = ColumnTransformer(
    transformers=[
        ('scale_log', scaler_log, log_transformer_cols_scaler),
        ('scale_sqrt', scaler_sqrt, sqrt_transformer_cols_scaler),
        ('scale_winsor_low', scaler_winsor_low, low_winsor_cols_scaler),
        ('scale_winsor_strong', scaler_winsor_strong, strong_winsor_cols_scaler),
        ('scale_poly', scaler_poly, poly_features_cols_scaler)
    ],
    remainder='passthrough'  # Include columns that are not specified without any transformations
)

preprocessor_scaling.set_output(transform='pandas')

variancer = VarianceThreshold(0.0001)

然后我将它们组合到我的管道中(在本例中为简单起见,逻辑回归)

log_pipe = Pipeline(steps=[
('preprocessor_target', preprocessor_target),
('preprocessor_scaling', preprocessor_scaling),
('zero_variance', variancer),
('logreg', LogisticRegression(penalty='l1', solver='liblinear'))])

hyperparameters = {
'logreg__C': np.logspace(-4, 4, 10)}

然后我按如下方式调用该函数:

log_grid = GridSearchCV(log_pipe, hyperparameters, cv=tscv, n_jobs=-1, verbose=1) # scoring='roc_auc'
log_grid.fit(X_train, y_train)

print('Best Hyperparameters:', log_grid.best_params_)
print('Best Cross-validation Score:', log_grid.best_score_)
print('Test Set Score:', log_grid.score(X_test, y_test))

y_pred = log_grid.predict(X_test)

# classification report
print(classification_report(y_test, y_pred))

这是我收到的警告消息:

/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py:778: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/metrics/_scorer.py", line 444, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 722, in score
    return self.steps[-1][1].score(Xt, y, **score_params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 668, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
                             ^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 419, in predict
    scores = self.decision_function(X)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_base.py", line 400, in decision_function
    X = self._validate_data(X, accept_sparse="csr", reset=False)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 565, in _validate_data
    X = check_array(X, input_name="X", **check_params)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/elias/anaconda3/lib/python3.11/site-packages/sklearn/utils/validation.py", line 921, in check_array
    _assert_all_finite(
...
ValueError: Input X contains NaN.
LogisticRegression does not accept missing values encoded as NaN natively.

到目前为止,我尝试做的是单独并按顺序评估预处理器:

单独预处理:

我应用了

preprocessor_target.fit_transform(X_train, y_train)
来转换训练特征和目标变量。同样,我使用
preprocessor_scaling.fit_transform(preprocessed_data, y_train)
(其中 preprocessed_data 是上一步的输出)来执行缩放,最后使用零方差预处理器。

检查这三个单独的顺序转换产生的数据时,输出是干净的(没有缺失值)。但是当在日志管道中使用时,我收到上述警告。

感谢您的帮助!如果没有任何结果,我将需要中止管道内的预处理步骤并“手动”使用它们。这没关系,因为我有静态数据,但是我仍然希望对这个问题和管道有更深入的了解。

pandas scikit-learn pipeline missing-data sklearn-pandas
1个回答
0
投票

如果您的数据包含空值,请将它们与 ColumnTransformer 或 FunctionTransformer 分开进行预处理。否则,您将在预处理步骤中创建空值。回顾一下他们

© www.soinside.com 2019 - 2024. All rights reserved.