我实现了三种 ML 算法(K 最近邻、决策树和随机森林),并使用四种不同的交叉验证技术(Hold-Out 方法、留一方法、K 折叠交叉验证、分层 K-每个算法的折叠交叉验证)。目标是评估性能指标并比较技术和算法。我的代码可以运行,但不同技术的评估指标值是相同的。这些值相同是正常的还是我做错了什么?
这是我的代码的一部分:
# Initialize classifiers
knn = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
dtree = DecisionTreeClassifier(random_state=42)
rf = RandomForestClassifier(n_estimators=20, criterion='entropy', random_state=0)
classifiers = {'KNN': knn, 'Decision Tree': dtree, 'Random Forest': rf}
# Define cross-validation methods
loo = LeaveOneOut()
kf = KFold(10)
skf = StratifiedKFold(n_splits=5)
cv_methods = {'Hold-Out Method': (X_train, X_test, y_train, y_test),
'Leave-One-Out Method': loo,
'K-Fold Cross-Validation': kf,
'Stratified K-Fold Cross-Validation': skf}
# Perform classification and evaluation for each classifier and cross-validation method
for clf_name, clf in classifiers.items():
print(f"Classifier: {clf_name}")
for cv_name, cv_method in cv_methods.items():
if cv_name == 'Hold-Out Method':
X_train_cv, X_test_cv, y_train_cv, y_test_cv = cv_method
clf.fit(X_train_cv, y_train_cv)
y_pred = clf.predict(X_test_cv)
else:
scores = cross_val_score(clf, X, y, cv=cv_method, scoring='accuracy')
# Calculate evaluation metrics
accuracy = accuracy_score(y_test_cv, y_pred)
precision = precision_score(y_test_cv, y_pred, average='weighted')
recall = recall_score(y_test_cv, y_pred, average='weighted')
f1 = f1_score(y_test_cv, y_pred, average='weighted')
confusion = confusion_matrix(y_test_cv, y_pred)
这是输出,对于每个分类器和交叉验证方法来说都是相同的:
Classifier: KNN
Hold-Out Method Metrics for KNN:
Accuracy: 0.864620939
Precision: 0.8661
Recall: 0.8646
F1 Score: 0.8652
Confusion Matrix:
[[326 41]
[ 34 153]]
Leave-One-Out Method Metrics for KNN:
Accuracy: 0.864620939
Precision: 0.8661
Recall: 0.8646
F1 Score: 0.8652
Confusion Matrix:
[[326 41]
[ 34 153]]
K-Fold Cross-Validation Metrics for KNN:
Accuracy: 0.864620939
Precision: 0.8661
Recall: 0.8646
F1 Score: 0.8652
Confusion Matrix:
[[326 41]
[ 34 153]]
Stratified K-Fold Cross-Validation Metrics for KNN:
Accuracy: 0.864620939
Precision: 0.8661
Recall: 0.8646
F1 Score: 0.8652
Confusion Matrix:
[[326 41]
[ 34 153]]
Classifier: Decision Tree
Hold-Out Method Metrics for Decision Tree:
Accuracy: 0.980144404
Precision: 0.9801
Recall: 0.9801
F1 Score: 0.9801
Confusion Matrix:
[[363 4]
[ 7 180]]
Leave-One-Out Method Metrics for Decision Tree:
Accuracy: 0.980144404
Precision: 0.9801
Recall: 0.9801
F1 Score: 0.9801
Confusion Matrix:
[[363 4]
[ 7 180]]
K-Fold Cross-Validation Metrics for Decision Tree:
Accuracy: 0.980144404
Precision: 0.9801
Recall: 0.9801
F1 Score: 0.9801
Confusion Matrix:
[[363 4]
[ 7 180]]
Stratified K-Fold Cross-Validation Metrics for Decision Tree:
Accuracy: 0.980144404
Precision: 0.9801
Recall: 0.9801
F1 Score: 0.9801
Confusion Matrix:
[[363 4]
[ 7 180]]
Classifier: Random Forest
Hold-Out Method Metrics for Random Forest:
Accuracy: 0.981949458
Precision: 0.9820
Recall: 0.9819
F1 Score: 0.9819
Confusion Matrix:
[[364 3]
[ 7 180]]
Leave-One-Out Method Metrics for Random Forest:
Accuracy: 0.981949458
Precision: 0.9820
Recall: 0.9819
F1 Score: 0.9819
Confusion Matrix:
[[364 3]
[ 7 180]]
K-Fold Cross-Validation Metrics for Random Forest:
Accuracy: 0.981949458
Precision: 0.9820
Recall: 0.9819
F1 Score: 0.9819
Confusion Matrix:
[[364 3]
[ 7 180]]
Stratified K-Fold Cross-Validation Metrics for Random Forest:
Accuracy: 0.981949458
Precision: 0.9820
Recall: 0.9819
F1 Score: 0.9819
Confusion Matrix:
[[364 3]
[ 7 180]]
为什么会发生这种情况?
线路
scores = cross_val_score(clf, X, y, cv=cv_method, scoring='accuracy')
不修改
clf
对象以使其适合。 (我自己也犯过几次这样的错误。这有点误导,因为你会看到模型适合控制台。)
现在发生的情况是模型已安装在该部分中
if cv_name == 'Hold-Out Method':
X_train_cv, X_test_cv, y_train_cv, y_test_cv = cv_method
clf.fit(X_train_cv, y_train_cv)
y_pred = clf.predict(X_test_cv)
并且您正在使用该模型 4 次来评估。
要对此进行测试,请从
cv_methods
中删除“Hold-Out Method”,您可能会收到模型尚未拟合的错误。