在嵌套的 GridSearchCV 管道中输入缺失值以避免数据泄漏

问题描述 投票:0回答:0

我对 sklearn 在已建立的 CV 和管道框架内估算值的方式有一些问题。所有这些都是为了避免全局插补,这将由于数据泄漏而扰乱模型性能。环顾几个链接和指南,我混合搭配并确定了以下代码片段中的内容。我试图在几个线性模型上使用它,但这个例子将坚持使用套索。

我的数据集由 95 个数值参数和 5 个分类参数组成。在总共 100 个观测值中,NaN 贯穿此处(8-27%,按列)。我的回复中没有 NaN,y.

在这里,我尝试使用 KNN 进行插补,相应地缩放数据,对于分类变量,分别使用最频繁的插补和一次热编码。

from sklearn import linear_model
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LassoLars
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

# Data prep
"""
We have 100 parameters; 95 are numerical and 5 are categorical.
Each row contains numerous missing values, which we need to impute *inside* the CV-loop
"""
df = pd.read_csv(dataPath, delimiter=',', skipinitialspace = True)
y = df.loc[:, 'y']
df = df.drop(['y'], axis=1)
df_num = df[df.columns[:95]]
df_cat = df[df.columns[95:]]

num_na = df_num.columns.to_list()
cat_na = df_cat.columns.to_list()


# Apply KNNImputing, scale afterwards
numeric_transformer = Pipeline(steps=[
   ('imputer', KNNImputer(n_neighbors=2, weights="uniform")),
   ('scaler', StandardScaler())])
# Most common occurrence imputing and dummy encoding
categorical_transformer = Pipeline(steps=[
   ('imputer', SimpleImputer(strategy='most_frequent')),
   ('onehot', OrdinalEncoder())])
# Do it for the established columns in df
preprocessor = ColumnTransformer(
   remainder = 'passthrough',
   transformers=[
       ('numeric', numeric_transformer, num_na),
       ('categorical', categorical_transformer, cat_na)
])

# Nested CV can be performed on the configured GridSearchCV directly that will 
# automatically use the refit best performing model on the test set from the outer loop
cv_inner = KFold(n_splits=3, shuffle=True, random_state=42) # Parameter validation
cv_outer = KFold(n_splits=5, shuffle=True, random_state=42) # Model validation
# define the model
pipe = Pipeline(steps=[('preprocessing', preprocessor),
                       ('clf', Lasso())]
)
# Gridsearch over parameter grid
alpha_grid = np.logspace(-4, 3, 100)
param_grid = [{'clf__alpha': alpha_grid}]
grid = GridSearchCV(pipe, cv=cv_inner, param_grid=param_grid, verbose=1, 
                return_train_score=True, scoring='neg_root_mean_squared_error',
                refit=True, n_jobs=-1)



scores = cross_val_score(grid, df, y, scoring='neg_root_mean_squared_error',
                        cv=cv_outer, n_jobs=-1)

print('Avg. RMSE across outer CV: %.3f ' % (np.mean(-scores)))

执行此框架时我得到

        RuntimeWarning: invalid value encountered in divide
  * (last_sum / last_over_new_count - new_sum) ** 2

逃避被 NaN 的现在所除。在这方面是否有任何聪明的 scikit-learn 头脑可以检查我的理智?

提前致谢。

python machine-learning scikit-learn cross-validation gridsearchcv
© www.soinside.com 2019 - 2024. All rights reserved.