我正在使用决策树和参数优化来训练模型。
我读到验证集的目标是评估训练期间的模型性能并帮助调整参数。
考虑到这一点,我不应该使用验证集
on grid_search.fit
而不是使用我的训练集吗?
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")
#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")
#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")
根据 GridSearchCV() 的 scikit-learn 文档,您输入到函数中的数据会自动分成折叠并执行交叉验证。因此,您应该只提供完整的数据集(减去最终的训练数据),而不用担心自己分割数据。
为此,您可能希望结合训练和验证数据集:
import numpy as np
# Merge the training and validation datasets, for use in the GridSearchCV() function.
X_opt = np.vstack((X_train, X_val))
y_opt = np.hstack((y_train, y_val))
param_grid = {
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy') # This uses 5-fold cross-validation.
grid_search.fit(X_opt, y_opt) # Fit to the merged datasets.
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")
#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")
#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")
您的脚本通过约 80% 的拟合数据进行训练来优化您的模型,剩余的约 20% 分配为验证数据。然后通过不同的折叠来改变它。通过使用上面修改的代码,您可以确保充分利用您拥有的训练和验证数据,同时仍然避免针对测试数据进行优化。您的理解是正确的,优化是针对验证数据执行的,但训练必须始终保留在训练数据集上。 GridSearchCV() 函数本质上就是这样做的,但是使用交叉折叠验证。