我应该使用训练集还是验证集来进行参数优化？

Question

我正在使用决策树和参数优化来训练模型。

我读到验证集的目标是评估训练期间的模型性能并帮助调整参数。

考虑到这一点，我不应该使用验证集

on grid_search.fit

而不是使用我的训练集吗？

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")

#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")

Answer 1

根据 GridSearchCV() 的 scikit-learn 文档，您输入到函数中的数据会自动分成折叠并执行交叉验证。因此，您应该只提供完整的数据集（减去最终的训练数据），而不用担心自己分割数据。

为此，您可能希望结合训练和验证数据集：

import numpy as np

# Merge the training and validation datasets, for use in the GridSearchCV() function.
X_opt = np.vstack((X_train, X_val))
y_opt = np.hstack((y_train, y_val))

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

clf = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')  # This uses 5-fold cross-validation.
grid_search.fit(X_opt, y_opt)  # Fit to the merged datasets.
best_params = grid_search.best_params_
print("Best Parameters:", best_params)
print("\n")

#Validation
best_clf = grid_search.best_estimator_
val_accuracy = best_clf.score(X_val, y_val)
print("Validation Accuracy with Best Model:", val_accuracy)
print("\n")

#Test
y_test_pred = best_clf.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_precision = precision_score(y_test, y_test_pred)
test_recall = recall_score(y_test, y_test_pred)
test_f1 = f1_score(y_test, y_test_pred)
print("Decision Tree Measurements on Test Set with Best Model:")
print("Accuracy:", test_accuracy)
print("Precision:", test_precision)
print("Recall:", test_recall)
print("F1 Score:", test_f1)
print("-------------------------------------------------------")

您的脚本通过约 80% 的拟合数据进行训练来优化您的模型，剩余的约 20% 分配为验证数据。然后通过不同的折叠来改变它。通过使用上面修改的代码，您可以确保充分利用您拥有的训练和验证数据，同时仍然避免针对测试数据进行优化。

您的理解是正确的，优化是针对验证数据执行的，但训练必须始终保留在训练数据集上。 GridSearchCV() 函数本质上就是这样做的，但是使用交叉折叠验证。

我应该使用训练集还是验证集来进行参数优化？

问题描述投票：0回答：1

1个回答

最新问题

我应该使用训练集还是验证集来进行参数优化？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1