指定 GridSearchCV 的参数网格中要选择的列

问题描述 投票:0回答:1

我想使用 sklearn 的 GridSearchCV 来训练具有某些特征集作为超参数的模型。

参数网格示例如下:

[
    {
        'clf': [LogisticRegression()],
        'clf__C': [0.5, 0.1, 0.05, 0.01],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    },
    {
        'clf': [DummyClassifier()],
        'clf__strategy': ['prior', 'most_frequent'],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    }
]

这意味着我希望 GridSearchCV 使用一组特征

['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)']
训练 4 个逻辑回归(每个 C 值一个),并使用一组特征
['COUNT(activities)']
训练 4 个逻辑回归。 虚拟模型也是如此。

这是我尝试过的

import pandas as pd
from typing import List, Dict
from functools import reduce
from utils import ClfSwitcher, update_pgrid

from optbinning import BinningProcess
from sklearn.model_selection import cross_validate, GridSearchCV, KFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.dummy import DummyClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

# 

# feature selector transformer. Given a set of features it will output a datraframe with all columns that contain the names of the features given in the parameter 'feature_names'

class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, feature_names):
        self.feature_names = feature_names

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        selected_features = [col for col in X.columns if any(name in col for name in self.feature_names)]
        return X[selected_features]


# nested cross validation setup

n_folds = 3
scoring = {'auc': 'roc_auc', 'log_loss': 'neg_log_loss', 'brier_score': 'neg_brier_score'}
p_grid =     [
    {
        'clf': [LogisticRegression()],
        'clf__C': [0.5, 0.1, 0.05, 0.01],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    },
    {
        'clf': [DummyClassifier()],
        'clf__strategy': ['prior', 'most_frequent'],
        'coltrans__feature_selector__feature_names': [
            ['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'], 
            ['COUNT(activities)']
        ]
    }
]


inner_cv = KFold(n_splits=n_folds, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=n_folds, shuffle=True, random_state=3)

# get the names of categorical and numerical features
num_vars = []
cat_vars = []
for v, t in zip(X.dtypes.index, X.dtypes):
    if ("int" in str(t)) or ("float" in str(t)):
        num_vars.append(v)
    else:
        cat_vars.append(v)

# initialize transfomers that will go in the columntransfomer

imp = SimpleImputer(strategy="median")
scl = StandardScaler()
ohe = OneHotEncoder(
    drop="first", handle_unknown="infrequent_if_exist", min_frequency=0.1
)

feature_selector = FeatureSelector(feature_names=['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'])

# build columntransfomer

t = [
    ("imp_scale", make_pipeline(imp, scl), num_vars ),
    ("ohe", ohe, cat_vars),
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')


# create a pipeline
pipe  = Pipeline([
    ('coltrans', col_transformer),
    ('clf', DummyClassifier()),
                        ])

# run cross-validation

clf = GridSearchCV(estimator=pipe, param_grid=p_grid, cv=inner_cv, refit=True, error_score='raise')

cv_results = cross_validate(
clf,
X,
y,
cv=outer_cv,
scoring=scoring,
return_estimator=False,
)

auc = reduce(lambda x, y: x + y, cv_results["test_auc"]) / n_folds
log_loss = reduce(lambda x, y: x + y, cv_results["test_log_loss"]) / n_folds


print(
" AUC estimate: ",
auc,
"\n",
"Log loss estimate: ",
log_loss
)

事情是这样的,如果按以下方式修改我的柱变压器:

t = [
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')

然后将其应用到X:

col_transformer.fit_transform(X)

我得到一个只有两列的数组,它工作得很好。问题是我必须将 feature_selector 转换器放入 ColumnTransformer 中,因为它需要列的名称才能工作。我不知道如何选择我想要的功能,然后确保它们经历所有其他转换(插补和单热编码)。我编写的代码有效,但是在使用列转换器之后,我得到了一个数组,其中包含所有初始数字特征以及由 one-hot-encoding 创建的所有虚拟列。

我已经尝试在实际管道中使用mlxtend的feature_selection,但是我真的不知道我想要选择的特征的索引,因为它们已经经过了one-hot-encoding(有没有办法绕过这个?)。

scikit-learn cross-validation gridsearchcv
1个回答
0
投票

按照你原来的方法:

t = [
    ("imp_scale", make_pipeline(imp, scl), num_vars ),
    ("ohe", ohe, cat_vars),
    ('feature_selector', feature_selector, cat_vars+num_vars),
]
    
col_transformer = ColumnTransformer(transformers=t, remainder='drop')

您最终会在前两个变压器转换后包含每个(num + cat)特征,然后是您想要包含的一个/两个特征,而无需通过最后一个变压器进行转换。 (另请参阅一致的ColumnTransformer了解相交的列列表及其链接的问题。)

您似乎只想包含功能子集,并相应地转换它们。因此,您应该在其余转换之前对选择器进行管道化:

processor = ColumnTransformer(t[:-1], remainder='drop')

pipe = Pipeline([
    ('select', feature_selector),
    ('process', processor),
])

由于您的特征选择器会生成数据帧,因此您不必担心列转换器获取特征名称,但您事先并不知道哪个特征子集会到达它。但是您可以在列规范中使用可调用而不是硬列表(并且您已经得到了!):

def num_type_detector(X):
    num_vars = []
    for v, t in zip(X.dtypes.index, X.dtypes):
        if ("int" in str(t)) or ("float" in str(t)):
            num_vars.append(v)
    return num_vars

def cat_type_detector(X):
    cat_vars = []
    for v, t in zip(X.dtypes.index, X.dtypes):
        if ("int" in str(t)) or ("float" in str(t)):
            cat_vars.append(v)
    return cat_vars

processor = ColumnTransformer(
    [
        ("imp_scale", make_pipeline(imp, scl), num_type_detector),
        ("ohe", ohe, cat_type_detector),
    ],
    remainder='drop',
)

pipe = Pipeline([
    ('select', feature_selector),
    ('process', processor),
])

您应该考虑更优雅的

num_type_detector
版本,例如使用
make_column_selector
(docs)。


如果您要使用自定义程度较低的功能选择器,则可以使用 pandas-out sklearn v1.2 中包含的功能。这不适用于稀疏数组(还),因此您需要在独热编码器中设置

sparse=False
,并且您可能会遇到混合类型的问题。

© www.soinside.com 2019 - 2024. All rights reserved.