我今天刚刚建立了我的第一个random forest classifier
,我正在努力提高它的性能。我正在阅读cross-validation
如何避免数据的overfitting
,从而获得更好的结果。我使用StratifiedKFold
实现了sklearn
,然而,令人惊讶的是这种方法不太准确。我读过很多帖子,表明cross-validating
比train_test_split
更有效率。
估算:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
K-折:
ss = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for train_index, test_index in ss.split(features, labels):
train_features, test_features = features[train_index], features[test_index]
train_labels, test_labels = labels[train_index], labels[test_index]
TTS:
train_feature, test_feature, train_label, test_label = \
train_test_split(features, labels, train_size=0.8, test_size=0.2, random_state=42)
以下是结果:
CV:
AUROC: 0.74
Accuracy Score: 74.74 %.
Specificity: 0.69
Precision: 0.75
Sensitivity: 0.79
Matthews correlation coefficient (MCC): 0.49
F1 Score: 0.77
TTS:
AUROC: 0.76
Accuracy Score: 76.23 %.
Specificity: 0.77
Precision: 0.79
Sensitivity: 0.76
Matthews correlation coefficient (MCC): 0.52
F1 Score: 0.77
这有可能吗?或者我错误地设置了我的模型?
另外,这是使用交叉验证的正确方法吗?