我想使用 sklearn 的 GridSearchCV 来训练具有某些特征集作为超参数的模型。
参数网格示例如下:
[
{
'clf': [LogisticRegression()],
'clf__C': [0.5, 0.1, 0.05, 0.01],
'coltrans__feature_selector__feature_names': [
['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'],
['COUNT(activities)']
]
},
{
'clf': [DummyClassifier()],
'clf__strategy': ['prior', 'most_frequent'],
'coltrans__feature_selector__feature_names': [
['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'],
['COUNT(activities)']
]
}
]
这意味着我希望 GridSearchCV 使用一组特征
['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)']
训练 4 个逻辑回归(每个 C 值一个),并使用一组特征 ['COUNT(activities)']
训练 4 个逻辑回归。
虚拟模型也是如此。
这是我尝试过的
import pandas as pd
from typing import List, Dict
from functools import reduce
from utils import ClfSwitcher, update_pgrid
from optbinning import BinningProcess
from sklearn.model_selection import cross_validate, GridSearchCV, KFold
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.dummy import DummyClassifier
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
#
# feature selector transformer. Given a set of features it will output a datraframe with all columns that contain the names of the features given in the parameter 'feature_names'
class FeatureSelector(BaseEstimator, TransformerMixin):
def __init__(self, feature_names):
self.feature_names = feature_names
def fit(self, X, y=None):
return self
def transform(self, X):
selected_features = [col for col in X.columns if any(name in col for name in self.feature_names)]
return X[selected_features]
# nested cross validation setup
n_folds = 3
scoring = {'auc': 'roc_auc', 'log_loss': 'neg_log_loss', 'brier_score': 'neg_brier_score'}
p_grid = [
{
'clf': [LogisticRegression()],
'clf__C': [0.5, 0.1, 0.05, 0.01],
'coltrans__feature_selector__feature_names': [
['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'],
['COUNT(activities)']
]
},
{
'clf': [DummyClassifier()],
'clf__strategy': ['prior', 'most_frequent'],
'coltrans__feature_selector__feature_names': [
['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'],
['COUNT(activities)']
]
}
]
inner_cv = KFold(n_splits=n_folds, shuffle=True, random_state=1)
outer_cv = KFold(n_splits=n_folds, shuffle=True, random_state=3)
# get the names of categorical and numerical features
num_vars = []
cat_vars = []
for v, t in zip(X.dtypes.index, X.dtypes):
if ("int" in str(t)) or ("float" in str(t)):
num_vars.append(v)
else:
cat_vars.append(v)
# initialize transfomers that will go in the columntransfomer
imp = SimpleImputer(strategy="median")
scl = StandardScaler()
ohe = OneHotEncoder(
drop="first", handle_unknown="infrequent_if_exist", min_frequency=0.1
)
feature_selector = FeatureSelector(feature_names=['COUNT(activities)', 'COUNT(events WHERE device_category = desktop)'])
# build columntransfomer
t = [
("imp_scale", make_pipeline(imp, scl), num_vars ),
("ohe", ohe, cat_vars),
('feature_selector', feature_selector, cat_vars+num_vars),
]
col_transformer = ColumnTransformer(transformers=t, remainder='drop')
# create a pipeline
pipe = Pipeline([
('coltrans', col_transformer),
('clf', DummyClassifier()),
])
# run cross-validation
clf = GridSearchCV(estimator=pipe, param_grid=p_grid, cv=inner_cv, refit=True, error_score='raise')
cv_results = cross_validate(
clf,
X,
y,
cv=outer_cv,
scoring=scoring,
return_estimator=False,
)
auc = reduce(lambda x, y: x + y, cv_results["test_auc"]) / n_folds
log_loss = reduce(lambda x, y: x + y, cv_results["test_log_loss"]) / n_folds
print(
" AUC estimate: ",
auc,
"\n",
"Log loss estimate: ",
log_loss
)
事情是这样的,如果按以下方式修改我的柱变压器:
t = [
('feature_selector', feature_selector, cat_vars+num_vars),
]
col_transformer = ColumnTransformer(transformers=t, remainder='drop')
然后将其应用到X:
col_transformer.fit_transform(X)
我得到一个只有两列的数组,它工作得很好。问题是我必须将 feature_selector 转换器放入 ColumnTransformer 中,因为它需要列的名称才能工作。我不知道如何选择我想要的功能,然后确保它们经历所有其他转换(插补和单热编码)。我编写的代码有效,但是在使用列转换器之后,我得到了一个数组,其中包含所有初始数字特征以及由 one-hot-encoding 创建的所有虚拟列。
我已经尝试在实际管道中使用mlxtend的feature_selection,但是我真的不知道我想要选择的特征的索引,因为它们已经经过了one-hot-encoding(有没有办法绕过这个?)。
按照你原来的方法:
t = [
("imp_scale", make_pipeline(imp, scl), num_vars ),
("ohe", ohe, cat_vars),
('feature_selector', feature_selector, cat_vars+num_vars),
]
col_transformer = ColumnTransformer(transformers=t, remainder='drop')
您最终会在前两个变压器转换后包含每个(num + cat)特征,然后是您想要包含的一个/两个特征,而无需通过最后一个变压器进行转换。 (另请参阅一致的ColumnTransformer了解相交的列列表及其链接的问题。)
您似乎只想包含功能子集,并相应地转换它们。因此,您应该在其余转换之前对选择器进行管道化:
processor = ColumnTransformer(t[:-1], remainder='drop')
pipe = Pipeline([
('select', feature_selector),
('process', processor),
])
由于您的特征选择器会生成数据帧,因此您不必担心列转换器获取特征名称,但您事先并不知道哪个特征子集会到达它。但是您可以在列规范中使用可调用而不是硬列表(并且您已经得到了!):
def num_type_detector(X):
num_vars = []
for v, t in zip(X.dtypes.index, X.dtypes):
if ("int" in str(t)) or ("float" in str(t)):
num_vars.append(v)
return num_vars
def cat_type_detector(X):
cat_vars = []
for v, t in zip(X.dtypes.index, X.dtypes):
if ("int" in str(t)) or ("float" in str(t)):
cat_vars.append(v)
return cat_vars
processor = ColumnTransformer(
[
("imp_scale", make_pipeline(imp, scl), num_type_detector),
("ohe", ohe, cat_type_detector),
],
remainder='drop',
)
pipe = Pipeline([
('select', feature_selector),
('process', processor),
])
您应该考虑更优雅的
num_type_detector
版本,例如使用 make_column_selector
(docs)。
如果您要使用自定义程度较低的功能选择器,则可以使用 pandas-out sklearn v1.2 中包含的功能。这不适用于稀疏数组(还),因此您需要在独热编码器中设置
sparse=False
,并且您可能会遇到混合类型的问题。