我正在尝试使用 titanic ddataset 并尝试正确使用 sklearn make_pipeline,但我对如何正确构建管道感到有点困惑。这是代码:
def sum_relatives(X):
X_copy = X.copy()
X_copy['total_relatives'] = X_copy['SibSp'] + X_copy['Parch']
return X_copy
class_order = [[1, 2, 3]]
ord_pipeline = make_pipeline(
OrdinalEncoder(categories=class_order)
)
def age_transformer(X):
X_copy = X.copy()
for index, row in self.median_age_by_class.iterrows():
class_value = row['Pclass']
median_age = row['median_age']
X_copy.loc[X_copy['Pclass'] == class_value, 'Age'] = X_copy.loc[X_copy['Pclass'] == class_value, 'Age'].fillna(median_age)
bins = [0, 10, 20, 30, 40, 50, 60, 70, 100]
X_copy['age_interval'] = pd.cut(X_copy['Age'], bins=bins)
return X_copy
def age_processor():
return make_pipeline(
FunctionTransformer(age_transformer),
)
total_relatives_pipeline = make_pipeline(
FunctionTransformer(sum_relatives)
)
cat_pipeline = make_pipeline(
OneHotEncoder(handle_unknown="ignore")
)
num_pipeline = make_pipeline([
StandardScaler()
])
preprocessing = ColumnTransformer([
("ord", ord_pipeline, ['Pclass']),
("age_processing", age_processor(), ['Pclass', 'Age']),
("total_relatives", total_relatives_pipeline, ['SibSp', 'Parch']),
("cat", cat_pipeline, ['Sex', 'Embarked', 'traveling_category', 'age_interval']),
("num", num_pipeline, ['Fare']),
])
在我的数据上调用“fit_transform”时,它给了我以下错误:
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[43], line 1
----> 1 data_processed = preprocessing.fit_transform(titanic_data)
2 data_processed.shape
File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/utils/_set_output.py: 157, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157 data_to_wrap = f(self, X, *args, **kwargs)
158 if isinstance(data_to_wrap, tuple):
159 # only wrap the first output for cross decomposition
160 return_tuple = (
161 _wrap_data_with_container(method, data_to_wrap[0], X, self),
162 *data_to_wrap[1:],
163 )
File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
1145 estimator._validate_params()
1147 with config_context(
1148 skip_parameter_validation=(
1149 prefer_skip_nested_validation or global_skip_validation
1150 )
1151 ):
-> 1152 return fit_method(estimator, *args, **kwargs)
...
445 "transform, or can be 'drop' or 'passthrough' "
446 "specifiers. '%s' (type %s) doesn't." % (t, type(t))
447 )
TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Pipeline(steps=[('list', [('scaler', StandardScaler())])])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't.
我知道给予管道的变压器不应包含在列表中,但我不知道为什么会引发此错误。有什么帮助吗?
您对管道的实现在我看来是正确的。我认为问题在于
make_pipeline()
中的方括号代表 num_pipeline
。替换为:
num_pipeline = make_pipeline(
StandardScaler()
)
或者,由于这只是一个步骤,如果您想要更简洁一点,您可以执行以下操作:
num_processor = StandardScaler()
preprocessing = ColumnTransformer([
...
("num", num_processor, ['Fare']),
])
如果您同意跳过定义
num_processor
,您可以直接提供 StandardScaler()
:
preprocessing = ColumnTransformer([
...
("num", StandardScaler(), ['Fare']),
])
或者要将两者放在一行中,请使用
:=
运算符:
preprocessing = ColumnTransformer([
...
("num", num_processor := StandardScaler(), ['Fare']),
])