我对机器学习很陌生,对如何正确使用超参数调整和模型评估感到有点困惑。 超参数调整应该在整个数据集上进行还是仅在训练集上进行?正确的操作顺序是什么? 您能否检查我的代码并建议我考虑该问题的最佳实践? 在这里,我首先对整个数据集使用超参数调整,然后仅在训练集上评估模型性能。这是对的吗?不会导致数据泄露吗?
超参数调优
numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1),
alphas = np.arange(0, 1.1, 0.1),
random_state=818,
n_jobs = -1)
model = make_pipeline(preprocessor, en_cv)
model.fit(X, y)
best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_
模型评估:
ElasticNet = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=l1_ratio))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)
ElasticNet.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(r2, mse)
提前致谢,祝您有美好的一天!
实际上,这段代码在包含约 80000 个观测值和约 150 列的数据集上运行大约需要 18 分钟。这是否足够?
欢迎来到机器学习世界!
关于超参数调整问题,应该“始终”在训练集上完成,而不是在整个数据集上完成。在整个数据集上调整超参数会引入我们所说的“数据泄漏”,其中来自测试集的信息(应该是不可见的)会影响模型训练过程。如果您这样做,那么您将得到“泄漏/太好”的性能估计。 关于第二个问题,基线管道将如下所示:
将数据集分为训练集和测试集。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error
# Split the data into training and testing sets first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)
numeric_features = X_train.select_dtypes(include=['int', 'float']).columns
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
]
)
# Perform hyperparameter tuning on the training set
en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1), alphas=np.arange(0, 1.1, 0.1), random_state=818, n_jobs=-1)
model = make_pipeline(preprocessor, en_cv)
model.fit(X_train, y_train) # Use only training data here
# Get the best hyperparameters
best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_
# Train a new model on the training data using the best hyperparameters
elastic_net_model = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=best_l1_ratio))
elastic_net_model.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = elastic_net_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
print(f"R^2: {r2}, MSE: {mse}")