不知道如何正确使用sklearn的make_pipeline

Question

我正在尝试使用 titanic ddataset 并尝试正确使用 sklearn make_pipeline，但我对如何正确构建管道感到有点困惑。这是代码：

def sum_relatives(X):
    X_copy = X.copy()
    X_copy['total_relatives'] = X_copy['SibSp'] + X_copy['Parch']
    return X_copy

class_order = [[1, 2, 3]]

ord_pipeline = make_pipeline(
    OrdinalEncoder(categories=class_order)    
    )

def age_transformer(X):
    X_copy = X.copy()
    for index, row in self.median_age_by_class.iterrows():
        class_value = row['Pclass']
        median_age = row['median_age']
        X_copy.loc[X_copy['Pclass'] == class_value, 'Age'] = X_copy.loc[X_copy['Pclass'] == class_value, 'Age'].fillna(median_age)
    bins = [0, 10, 20, 30, 40, 50, 60, 70, 100]
    X_copy['age_interval'] = pd.cut(X_copy['Age'], bins=bins)
    return X_copy

def age_processor():
    return make_pipeline(
        FunctionTransformer(age_transformer),
)

total_relatives_pipeline = make_pipeline(
    FunctionTransformer(sum_relatives)
)

cat_pipeline = make_pipeline(
    OneHotEncoder(handle_unknown="ignore")
)

num_pipeline = make_pipeline([
        StandardScaler()
])

preprocessing = ColumnTransformer([
    ("ord", ord_pipeline, ['Pclass']),
    ("age_processing", age_processor(), ['Pclass', 'Age']),
    ("total_relatives", total_relatives_pipeline, ['SibSp', 'Parch']),
    ("cat", cat_pipeline, ['Sex', 'Embarked', 'traveling_category', 'age_interval']),
    ("num", num_pipeline, ['Fare']),
])

在我的数据上调用“fit_transform”时，它给了我以下错误：

Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[43], line 1
----> 1 data_processed = preprocessing.fit_transform(titanic_data)
      2 data_processed.shape

File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/utils/_set_output.py:    157, in _wrap_method_output.<locals>.wrapped(self, X, *args, **kwargs)
155 @wraps(f)
156 def wrapped(self, X, *args, **kwargs):
--> 157     data_to_wrap = f(self, X, *args, **kwargs)
158     if isinstance(data_to_wrap, tuple):
159         # only wrap the first output for cross decomposition
160         return_tuple = (
161             _wrap_data_with_container(method, data_to_wrap[0], X, self),
162             *data_to_wrap[1:],
163         )

File ~/.virtualenvs/handson/lib/python3.10/site-packages/sklearn/base.py:1152, in _fit_context.<locals>.decorator.<locals>.wrapper(estimator, *args, **kwargs)
   1145     estimator._validate_params()
   1147 with config_context(
   1148     skip_parameter_validation=(
   1149         prefer_skip_nested_validation or global_skip_validation
   1150     )
   1151 ):
-> 1152     return fit_method(estimator, *args, **kwargs)
...
445         "transform, or can be 'drop' or 'passthrough' "
446         "specifiers. '%s' (type %s) doesn't." % (t, type(t))
447     )

TypeError: All estimators should implement fit and transform, or can be 'drop' or 'passthrough' specifiers. 'Pipeline(steps=[('list', [('scaler', StandardScaler())])])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't.

我知道给予管道的变压器不应包含在列表中，但我不知道为什么会引发此错误。有什么帮助吗？

Answer 1

您对管道的实现在我看来是正确的。我认为问题在于

make_pipeline()

中的方括号代表

num_pipeline

。替换为：

num_pipeline = make_pipeline(
    StandardScaler()
)

或者，由于这只是一个步骤，如果您想要更简洁一点，您可以执行以下操作：

num_processor = StandardScaler()

preprocessing = ColumnTransformer([
    ...
    ("num", num_processor, ['Fare']),
])

如果您同意跳过定义

num_processor

，您可以直接提供

StandardScaler()

：

preprocessing = ColumnTransformer([
    ...
    ("num", StandardScaler(), ['Fare']),
])

或者要将两者放在一行中，请使用

:=

运算符：

preprocessing = ColumnTransformer([
    ...
    ("num", num_processor := StandardScaler(), ['Fare']),
])

不知道如何正确使用sklearn的make_pipeline

问题描述投票：0回答：1

1个回答

最新问题

不知道如何正确使用sklearn的make_pipeline

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1