指定 GridSearchCV 的参数网格中要选择的列

Question

我想使用 sklearn 的 GridSearchCV 来训练具有某些特征集作为超参数的模型。

参数网格示例如下：

[
    {
        'clf': [LogisticRegression()],
        'clf__C': [0.5, 0.1, 0.05, 0.01],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    },
    {
        'clf': [DummyClassifier()],
        'clf__strategy': ['prior', 'most_frequent'],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    }
]

这意味着我希望 GridSearchCV 使用一组特征

['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)']

训练 4 个逻辑回归（每个 C 值一个），并使用一组特征

['COUNT(activities)']

训练 4 个逻辑回归。虚拟模型也是如此。

这是我尝试过的

import pandas as pd
from typing import List, Dict
from functools import reduce
from utils import ClfSwitcher, update_pgrid

from optbinning import BinningProcess
from sklearn.model_selection import cross_validate, GridSearchCV, KFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.dummy import DummyClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

# 

# feature selector transformer. Given a set of features it will output a datraframe with all columns that contain the names of the features given in the parameter 'feature_names'

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self.feature_names = feature_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        selected_features = [col for col in X.columns if any(name in col for name in self.feature_names)]
        return X[selected_features]


# nested cross validation setup

n_folds = 3
scoring = {'auc': 'roc_auc', 'log_loss': 'neg_log_loss', 'brier_score': 'neg_brier_score'}
p_grid =     [
    {
        'clf': [LogisticRegression()],
        'clf__C': [0.5, 0.1, 0.05, 0.01],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    },
    {
        'clf': [DummyClassifier()],
        'clf__strategy': ['prior', 'most_frequent'],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    }
]


inner_cv = KFold(n_splits=n_folds, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=n_folds, shuffle=True, random_state=3)

# get the names of categorical and numerical features
num_vars = []
cat_vars = []
for v, t in zip(X.dtypes.index, X.dtypes):
    if ("int" in str(t)) or ("float" in str(t)):
        num_vars.append(v)
    else:
        cat_vars.append(v)

# initialize transfomers that will go in the columntransfomer

imp = SimpleImputer(strategy="median")
scl = StandardScaler()
ohe = OneHotEncoder(
    drop="first", handle_unknown="infrequent_if_exist", min_frequency=0.1
)

feature_selector = FeatureSelector(feature_names=['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'])

# build columntransfomer

t = [
    ("imp_scale", make_pipeline(imp, scl), num_vars ),
    ("ohe", ohe, cat_vars),
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')


# create a pipeline
pipe  = Pipeline([
    ('coltrans', col_transformer),
    ('clf', DummyClassifier()),
                        ])

# run cross-validation

clf = GridSearchCV(estimator=pipe, param_grid=p_grid, cv=inner_cv, refit=True, error_score='raise')

cv_results = cross_validate(
clf,
X,
y,
cv=outer_cv,
scoring=scoring,
return_estimator=False,
)

auc = reduce(lambda x, y: x + y, cv_results["test_auc"]) / n_folds
log_loss = reduce(lambda x, y: x + y, cv_results["test_log_loss"]) / n_folds


print(
" AUC estimate: ",
auc,
"\n",
"Log loss estimate: ",
log_loss
)

事情是这样的，如果按以下方式修改我的柱变压器：

t = [
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')

然后将其应用到X：

col_transformer.fit_transform(X)

我得到一个只有两列的数组，它工作得很好。问题是我必须将 feature_selector 转换器放入 ColumnTransformer 中，因为它需要列的名称才能工作。我不知道如何选择我想要的功能，然后确保它们经历所有其他转换（插补和单热编码）。我编写的代码有效，但是在使用列转换器之后，我得到了一个数组，其中包含所有初始数字特征以及由 one-hot-encoding 创建的所有虚拟列。

我已经尝试在实际管道中使用mlxtend的feature_selection，但是我真的不知道我想要选择的特征的索引，因为它们已经经过了one-hot-encoding（有没有办法绕过这个？）。

Answer 1

按照你原来的方法：

t = [
    ("imp_scale", make_pipeline(imp, scl), num_vars ),
    ("ohe", ohe, cat_vars),
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')

您最终会在前两个变压器转换后包含每个（num + cat）特征，然后是您想要包含的一个/两个特征，而无需通过最后一个变压器进行转换。（另请参阅一致的ColumnTransformer了解相交的列列表及其链接的问题。）

您似乎只想包含功能子集，并相应地转换它们。因此，您应该在其余转换之前对选择器进行管道化：

processor = ColumnTransformer(t[:-1], remainder='drop')

pipe = Pipeline([
    ('select', feature_selector),
    ('process', processor),
])

由于您的特征选择器会生成数据帧，因此您不必担心列转换器获取特征名称，但您事先并不知道哪个特征子集会到达它。但是您可以在列规范中使用可调用而不是硬列表（并且您已经得到了！）：

def num_type_detector(X):
    num_vars = []
    for v, t in zip(X.dtypes.index, X.dtypes):
        if ("int" in str(t)) or ("float" in str(t)):
            num_vars.append(v)
    return num_vars

def cat_type_detector(X):
    cat_vars = []
    for v, t in zip(X.dtypes.index, X.dtypes):
        if ("int" in str(t)) or ("float" in str(t)):
            cat_vars.append(v)
    return cat_vars

processor = ColumnTransformer(
    [
        ("imp_scale", make_pipeline(imp, scl), num_type_detector),
        ("ohe", ohe, cat_type_detector),
    ],
    remainder='drop',
)

pipe = Pipeline([
    ('select', feature_selector),
    ('process', processor),
])

您应该考虑更优雅的

num_type_detector

版本，例如使用

make_column_selector

(docs)。

如果您要使用自定义程度较低的功能选择器，则可以使用 pandas-out sklearn v1.2 中包含的功能。这不适用于稀疏数组（还），因此您需要在独热编码器中设置

sparse=False

，并且您可能会遇到混合类型的问题。

指定 GridSearchCV 的参数网格中要选择的列

问题描述投票：0回答：1

1个回答

最新问题

指定 GridSearchCV 的参数网格中要选择的列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1