scikit-learn 中的超参数调整和模型评估

问题描述 投票:0回答:1

我对机器学习很陌生,对如何正确使用超参数调整和模型评估感到有点困惑。 超参数调整应该在整个数据集上进行还是仅在训练集上进行?正确的操作顺序是什么? 您能否检查我的代码并建议我考虑该问题的最佳实践? 在这里,我首先对整个数据集使用超参数调整,然后仅在训练集上评估模型性能。这是对的吗?不会导致数据泄露吗?

超参数调优

numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1),
                     alphas = np.arange(0, 1.1, 0.1),
                     random_state=818,
                     n_jobs = -1)

model = make_pipeline(preprocessor, en_cv)
model.fit(X, y)

best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_

模型评估:

ElasticNet = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=l1_ratio))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)

ElasticNet.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(r2, mse)

提前致谢,祝您有美好的一天!

实际上,这段代码在包含约 80000 个观测值和约 150 列的数据集上运行大约需要 18 分钟。这是否足够?

machine-learning scikit-learn evaluation hyperparameters
1个回答
0
投票

欢迎来到机器学习世界!

关于超参数调整问题,应该“始终”在训练集上完成,而不是在整个数据集上完成。在整个数据集上调整超参数会引入我们所说的“数据泄漏”,其中来自测试集的信息(应该是不可见的)会影响模型训练过程。如果您这样做,那么您将得到“泄漏/太好”的性能估计。 关于第二个问题,基线管道将如下所示:

将数据集分为训练集和测试集。
  1. 仅在训练集上执行超参数调整。
  2. 确定最佳超参数后,在训练集上使用这些超参数重新训练模型。
  3. 评估模型在测试集上的性能,以获得对其泛化能力的无偏估计。
  4. 您发布的代码应基于上述内容:

import numpy as np from sklearn.model_selection import train_test_split from sklearn.pipeline import make_pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.linear_model import ElasticNet, ElasticNetCV from sklearn.metrics import r2_score, mean_squared_error # Split the data into training and testing sets first X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818) numeric_features = X_train.select_dtypes(include=['int', 'float']).columns categorical_features = X_train.select_dtypes(include=['object', 'category']).columns preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features) ] ) # Perform hyperparameter tuning on the training set en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1), alphas=np.arange(0, 1.1, 0.1), random_state=818, n_jobs=-1) model = make_pipeline(preprocessor, en_cv) model.fit(X_train, y_train) # Use only training data here # Get the best hyperparameters best_alpha = en_cv.alpha_ best_l1_ratio = en_cv.l1_ratio_ # Train a new model on the training data using the best hyperparameters elastic_net_model = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=best_l1_ratio)) elastic_net_model.fit(X_train, y_train) # Evaluate the model on the test set y_pred = elastic_net_model.predict(X_test) r2 = r2_score(y_test, y_pred) mse = mean_squared_error(y_test, y_pred) print(f"R^2: {r2}, MSE: {mse}")

© www.soinside.com 2019 - 2024. All rights reserved.