不知道如何正确使用sklearn的make_pipeline

问题描述 投票:0回答:1

我正在尝试使用 titanic ddataset 并尝试正确使用 sklearn make_pipeline,但我对如何正确构建管道感到有点困惑。这是代码:

def sum_relatives(X):
    X_copy = X.copy()
    X_copy['total_relatives'] = X_copy['SibSp'] + X_copy['Parch']
    return X_copy

class_order = [[1, 2, 3]]

ord_pipeline = make_pipeline(
    OrdinalEncoder(categories=class_order)    
    )

def age_transformer(X):
    X_copy = X.copy()
    for index, row in self.median_age_by_class.iterrows():
        class_value = row['Pclass']
        median_age = row['median_age']
        X_copy.loc[X_copy['Pclass'] == class_value, 'Age'] = X_copy.loc[X_copy['Pclass'] == class_value, 'Age'].fillna(median_age)
    bins = [0, 10, 20, 30, 40, 50, 60, 70, 100]
    X_copy['age_interval'] = pd.cut(X_copy['Age'], bins=bins)
    return X_copy

def age_processor():
    return make_pipeline(
        FunctionTransformer(age_transformer),
)

total_relatives_pipeline = make_pipeline(
    FunctionTransformer(sum_relatives)
)

cat_pipeline = make_pipeline(
    OneHotEncoder(handle_unknown="ignore")
)

num_pipeline = make_pipeline([
        StandardScaler()
])

preprocessing = ColumnTransformer([
    ("ord", ord_pipeline, ['Pclass']),
    ("age_processing", age_processor(), ['Pclass', 'Age']),
    ("total_relatives", total_relatives_pipeline, ['SibSp', 'Parch']),
    ("cat", cat_pipeline, ['Sex', 'Embarked', 'traveling_category', 'age_interval']),
    ("num", num_pipeline, ['Fare']),
])

在我的数据上调用“fit_transform”时,它给了我以下错误:

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[43], line 1
----> 1 data_processed = preprocessing.fit_transform(titanic_data)
      2 data_processed.shape

File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/utils/_set_output.py:    157, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157     data_to_wrap = f(self, X, *args, **kwargs)
158     if isinstance(data_to_wrap, tuple):
159         # only wrap the first output for cross decomposition
160         return_tuple = (
161             _wrap_data_with_container(method, data_to_wrap[0], X, self),
162             *data_to_wrap[1:],
163         )

File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)
...
445         "transform, or can be 'drop' or 'passthrough' "
446         "specifiers. '%s' (type %s) doesn't." % (t, type(t))
447     )

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Pipeline(steps=[('list', [('scaler', StandardScaler())])])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't.

我知道给予管道的变压器不应包含在列表中,但我不知道为什么会引发此错误。有什么帮助吗?

python python-3.x scikit-learn pipeline
1个回答
0
投票

您对管道的实现在我看来是正确的。我认为问题在于

make_pipeline()
中的方括号代表
num_pipeline
。替换为:

num_pipeline = make_pipeline(
    StandardScaler()
)

或者,由于这只是一个步骤,如果您想要更简洁一点,您可以执行以下操作:

num_processor = StandardScaler()

preprocessing = ColumnTransformer([
    ...
    ("num", num_processor, ['Fare']),
])

如果您同意跳过定义

num_processor
,您可以直接提供
StandardScaler()

preprocessing = ColumnTransformer([
    ...
    ("num", StandardScaler(), ['Fare']),
])

或者要将两者放在一行中,请使用

:=
运算符:

preprocessing = ColumnTransformer([
    ...
    ("num", num_processor := StandardScaler(), ['Fare']),
])
© www.soinside.com 2019 - 2024. All rights reserved.