我应该使用训练集还是验证集来进行参数优化?

问题描述 投票:0回答:1

我正在使用决策树和参数优化来训练模型。

我读到验证集的目标是评估训练期间的模型性能并帮助调整参数。

考虑到这一点,我不应该使用验证集

on grid_search.fit
而不是使用我的训练集吗?

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")

#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")
python machine-learning prediction training-data gridsearchcv
1个回答
0
投票

根据 GridSearchCV() 的 scikit-learn 文档,您输入到函数中的数据会自动分成折叠并执行交叉验证。因此,您应该只提供完整的数据集(减去最终的训练数据),而不用担心自己分割数据。

为此,您可能希望结合训练和验证数据集:

import numpy as np # Merge the training and validation datasets, for use in the GridSearchCV() function. X_opt = np.vstack((X_train, X_val)) y_opt = np.hstack((y_train, y_val)) param_grid = { 'max_depth': [3, 5, 7, 10], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } clf = DecisionTreeClassifier(random_state=42) grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy') # This uses 5-fold cross-validation. grid_search.fit(X_opt, y_opt) # Fit to the merged datasets. best_params = grid_search.best_params_ print("Best Parameters:", best_params) print("\n") #Validation best_clf = grid_search.best_estimator_ val_accuracy = best_clf.score(X_val, y_val) print("Validation Accuracy with Best Model:", val_accuracy) print("\n") #Test y_test_pred = best_clf.predict(X_test) test_accuracy = accuracy_score(y_test, y_test_pred) test_precision = precision_score(y_test, y_test_pred) test_recall = recall_score(y_test, y_test_pred) test_f1 = f1_score(y_test, y_test_pred) print("Decision Tree Measurements on Test Set with Best Model:") print("Accuracy:", test_accuracy) print("Precision:", test_precision) print("Recall:", test_recall) print("F1 Score:", test_f1) print("-------------------------------------------------------")
您的脚本通过约 80% 的拟合数据进行训练来优化您的模型,剩余的约 20% 分配为验证数据。然后通过不同的折叠来改变它。通过使用上面修改的代码,您可以确保充分利用您拥有的训练和验证数据,同时仍然避免针对测试数据进行优化。

您的理解是正确的,优化是针对验证数据执行的,但训练必须始终保留在训练数据集上。 GridSearchCV() 函数本质上就是这样做的,但是使用交叉折叠验证。

© www.soinside.com 2019 - 2024. All rights reserved.