在 SKLearn 管道中使用自定义转换器时出错,但不是作为独立转换器

问题描述 投票:0回答:1

作为练习,我尝试创建一个自定义转换器,它接受数据集和标签并返回转换后的数据集,仅保留与标签相关性高于特定阈值的那些列。变压器由以下代码给出:

from sklearn.base import BaseEstimator, TransformerMixin

class CorrelatedAttributesKeeper(BaseEstimator, TransformerMixin):
    def __init__(self, correlation_threshold = 0.5):
        self.correlation_threshold = correlation_threshold
        self.returned_indices = []    
    def fit(self, X, y=None):
        correlations = []
        for col in X:
            correlations.append(np.corrcoef(X[col].to_numpy(), y.to_numpy())[0,1])
        for idx, x in enumerate(correlations):
            if x > self.correlation_threshold:
                self.returned_indices.append(idx)
        return self
    def transform(self, X):
        return X.iloc[:, self.returned_indices]

以下内容似乎按预期工作:

high_correl_transformer = CorrelatedAttributesKeeper(0.5)
transformed_housing_num = high_correl_transformer.fit_transform(housing_num, housing_labels) 

但是,尝试将其作为管道的一部分运行会出错:

num_pipeline2 = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
    ('correl_keeper', CorrelatedAttributesKeeper()),])
housing_num_tr2 = num_pipeline.fit_transform(housing_num, housing_labels)

这会产生以下错误

Traceback (most recent call last):
  File "<string>", line 17, in __PYTHON_EL_eval
  File "<string>", line 3, in <module>
  File "/tmp/babel-0JUX9y/python-okiPo8", line 6, in <module>
    housing_num_tr2 = num_pipeline.fit_transform(housing_num, housing_labels)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/sklearn/pipeline.py", line 543, in fit_transform
    return last_step.fit_transform(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/sklearn/base.py", line 1101, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: CorrelatedAttributesKeeper.fit() takes 1 positional argument but 3 were given

我正在努力理解堆栈跟踪。它指出 fit 需要 1 个位置参数,但看起来至少需要 3 个位置参数,无论是在我的定义中还是在堆栈跟踪的底部:self、X 和 y。我没有得到什么?

谢谢

python scikit-learn pipeline
1个回答
1
投票

我不知道是什么导致了您报告的特定

TypeError
,但这个答案解决了班级的另一个问题。

自定义转换器之前的步骤返回

numpy
数组,这会导致自定义转换器出错,因为它需要
pandas
数据帧。我打算建议使用
StandardScaler().set_output(transform='pandas')
SimpleImputer().set_output(transform='pandas')
来配置它们返回数据帧,但是我认为 other 自定义估计器也会出错,因为它也期望
numpy
数组(我参考了 github您正在使用的示例)。

您可以更改自定义转换器以使用

numpy
数组。我已经在下面完成了,它在运行时有效:

make_pipeline(
    SimpleImputer(),
    StandardScaler(),
    CorrelatedAttributesKeeper()
).fit_transform(X, y)

我还进行了一些其他更改,使其更符合

sklearn
估算器的要求。这些更改包括添加尾部下划线来表示适合的属性;定义
self.n_features_in_
;并且不在
__init__
中创建新变量(限制为仅存储提供的参数)。

修改后的类:

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array, check_X_y

class CorrelatedAttributesKeeper(BaseEstimator, TransformerMixin):
    def __init__(self, correlation_threshold=0.5):
        self.correlation_threshold = correlation_threshold
        
    def fit(self, X, y=None):
        
        #To np if necessary, and checks
        X, y = check_X_y(X, y)
            
        self.n_features_in_ = X.shape[1]
        self.returned_indices_ = []
        
        correlations = []
        
        for col_idx in range(self.n_features_in_):
            correlations.append(np.corrcoef(X[:, col_idx], y)[0,1])
        
        for idx, x in enumerate(correlations):
            if x > self.correlation_threshold:
                self.returned_indices_.append(idx)
        
        return self
    
    def transform(self, X):
        X = check_array(X)
        return X[:, self.returned_indices_]

数据和测试:

#
#Test data
#
np.random.seed(0)
X = pd.DataFrame({
    'f0': np.linspace(0, 1, 100),
    'f1': np.linspace(0, 1, 100) + np.random.uniform(-0.4, .4, size=100),
    'f2': np.random.normal(size=100),
    'f3': np.random.normal(size=100),
})
y = pd.Series(np.linspace(0, 1, 100))

# Estimator should return f0 and f1, but not f2 or f3
display(
    'Correlation values',
    pd.concat([X, y.to_frame()], axis=1).corr().iloc[:-1, -1]
)

#Test separately
#with dataframes
CorrelatedAttributesKeeper().fit_transform(X, y)#.returned_indices_
#with numpy arrays
CorrelatedAttributesKeeper().fit_transform(X.values, y.values)#.returned_indices_

#Test in pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

make_pipeline(
    SimpleImputer(),
    StandardScaler(),
    CorrelatedAttributesKeeper()
).fit_transform(X, y)
© www.soinside.com 2019 - 2024. All rights reserved.