作为练习,我尝试创建一个自定义转换器,它接受数据集和标签并返回转换后的数据集,仅保留与标签相关性高于特定阈值的那些列。变压器由以下代码给出:
from sklearn.base import BaseEstimator, TransformerMixin
class CorrelatedAttributesKeeper(BaseEstimator, TransformerMixin):
def __init__(self, correlation_threshold = 0.5):
self.correlation_threshold = correlation_threshold
self.returned_indices = []
def fit(self, X, y=None):
correlations = []
for col in X:
correlations.append(np.corrcoef(X[col].to_numpy(), y.to_numpy())[0,1])
for idx, x in enumerate(correlations):
if x > self.correlation_threshold:
self.returned_indices.append(idx)
return self
def transform(self, X):
return X.iloc[:, self.returned_indices]
以下内容似乎按预期工作:
high_correl_transformer = CorrelatedAttributesKeeper(0.5)
transformed_housing_num = high_correl_transformer.fit_transform(housing_num, housing_labels)
但是,尝试将其作为管道的一部分运行会出错:
num_pipeline2 = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
('correl_keeper', CorrelatedAttributesKeeper()),])
housing_num_tr2 = num_pipeline.fit_transform(housing_num, housing_labels)
这会产生以下错误
Traceback (most recent call last):
File "<string>", line 17, in __PYTHON_EL_eval
File "<string>", line 3, in <module>
File "/tmp/babel-0JUX9y/python-okiPo8", line 6, in <module>
housing_num_tr2 = num_pipeline.fit_transform(housing_num, housing_labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/sklearn/base.py", line 1474, in wrapper
return fit_method(estimator, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/sklearn/pipeline.py", line 543, in fit_transform
return last_step.fit_transform(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/sklearn/utils/_set_output.py", line 295, in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/sklearn/base.py", line 1101, in fit_transform
return self.fit(X, y, **fit_params).transform(X)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: CorrelatedAttributesKeeper.fit() takes 1 positional argument but 3 were given
我正在努力理解堆栈跟踪。它指出 fit 需要 1 个位置参数,但看起来至少需要 3 个位置参数,无论是在我的定义中还是在堆栈跟踪的底部:self、X 和 y。我没有得到什么?
谢谢
我不知道是什么导致了您报告的特定
TypeError
,但这个答案解决了班级的另一个问题。
自定义转换器之前的步骤返回
numpy
数组,这会导致自定义转换器出错,因为它需要 pandas
数据帧。我打算建议使用 StandardScaler().set_output(transform='pandas')
和 SimpleImputer().set_output(transform='pandas')
来配置它们返回数据帧,但是我认为 other 自定义估计器也会出错,因为它也期望 numpy
数组(我参考了 github您正在使用的示例)。
您可以更改自定义转换器以使用
numpy
数组。我已经在下面完成了,它在运行时有效:
make_pipeline(
SimpleImputer(),
StandardScaler(),
CorrelatedAttributesKeeper()
).fit_transform(X, y)
我还进行了一些其他更改,使其更符合
sklearn
估算器的要求。这些更改包括添加尾部下划线来表示适合的属性;定义 self.n_features_in_
;并且不在 __init__
中创建新变量(限制为仅存储提供的参数)。
修改后的类:
import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils import check_array, check_X_y
class CorrelatedAttributesKeeper(BaseEstimator, TransformerMixin):
def __init__(self, correlation_threshold=0.5):
self.correlation_threshold = correlation_threshold
def fit(self, X, y=None):
#To np if necessary, and checks
X, y = check_X_y(X, y)
self.n_features_in_ = X.shape[1]
self.returned_indices_ = []
correlations = []
for col_idx in range(self.n_features_in_):
correlations.append(np.corrcoef(X[:, col_idx], y)[0,1])
for idx, x in enumerate(correlations):
if x > self.correlation_threshold:
self.returned_indices_.append(idx)
return self
def transform(self, X):
X = check_array(X)
return X[:, self.returned_indices_]
数据和测试:
#
#Test data
#
np.random.seed(0)
X = pd.DataFrame({
'f0': np.linspace(0, 1, 100),
'f1': np.linspace(0, 1, 100) + np.random.uniform(-0.4, .4, size=100),
'f2': np.random.normal(size=100),
'f3': np.random.normal(size=100),
})
y = pd.Series(np.linspace(0, 1, 100))
# Estimator should return f0 and f1, but not f2 or f3
display(
'Correlation values',
pd.concat([X, y.to_frame()], axis=1).corr().iloc[:-1, -1]
)
#Test separately
#with dataframes
CorrelatedAttributesKeeper().fit_transform(X, y)#.returned_indices_
#with numpy arrays
CorrelatedAttributesKeeper().fit_transform(X.values, y.values)#.returned_indices_
#Test in pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
make_pipeline(
SimpleImputer(),
StandardScaler(),
CorrelatedAttributesKeeper()
).fit_transform(X, y)