我因 n_estimators 的最低值而获得最高分。据我了解,更多的树应该总是会提高性能。谁能解释一下这里发生了什么?
输入:
# estimate n_estimators
param_test1 = {'n_estimators': range(20, 800, 30)}
clf = RandomForestClassifier(random_state = 10,
oob_score = True,
max_depth = 6,
max_features = 'sqrt')
gsearch1 = GridSearchCV(
estimator=clf,
param_grid=param_test1,
scoring='roc_auc',
iid=False,
cv=5)
gsearch1.fit(X, y)
gsearch1.grid_scores_, gsearch1.best_params_, gsearch1.best_score_
输出:
([mean: 0.87685, std: 0.03149, params: {u'n_estimators': 20},
mean: 0.87551, std: 0.02979, params: {u'n_estimators': 50},
mean: 0.87588, std: 0.02970, params: {u'n_estimators': 80},
mean: 0.87545, std: 0.03043, params: {u'n_estimators': 110},
mean: 0.87593, std: 0.02979, params: {u'n_estimators': 140},
mean: 0.87506, std: 0.02913, params: {u'n_estimators': 170},
mean: 0.87599, std: 0.02890, params: {u'n_estimators': 200},
mean: 0.87559, std: 0.02875, params: {u'n_estimators': 230},
mean: 0.87561, std: 0.02890, params: {u'n_estimators': 260},
mean: 0.87500, std: 0.02867, params: {u'n_estimators': 290},
mean: 0.87476, std: 0.02848, params: {u'n_estimators': 320},
mean: 0.87434, std: 0.02800, params: {u'n_estimators': 350},
mean: 0.87408, std: 0.02823, params: {u'n_estimators': 380},
mean: 0.87461, std: 0.02789, params: {u'n_estimators': 410},
mean: 0.87452, std: 0.02764, params: {u'n_estimators': 440},
mean: 0.87466, std: 0.02775, params: {u'n_estimators': 470},
mean: 0.87498, std: 0.02805, params: {u'n_estimators': 500},
mean: 0.87530, std: 0.02797, params: {u'n_estimators': 530},
mean: 0.87519, std: 0.02760, params: {u'n_estimators': 560},
mean: 0.87498, std: 0.02789, params: {u'n_estimators': 590},
mean: 0.87529, std: 0.02784, params: {u'n_estimators': 620},
mean: 0.87526, std: 0.02792, params: {u'n_estimators': 650},
mean: 0.87553, std: 0.02807, params: {u'n_estimators': 680},
mean: 0.87540, std: 0.02794, params: {u'n_estimators': 710},
mean: 0.87561, std: 0.02786, params: {u'n_estimators': 740},
mean: 0.87554, std: 0.02814, params: {u'n_estimators': 770}],
{u'n_estimators': 20},
0.87684895838888188)
在随机森林中使用较少数量的树(n_估计器)获得最高分可能会由于各种原因而发生,例如过度拟合、有利于更简单模型的数据集特征、超参数的交互、训练过程中的随机性、交叉验证的可变性,以及网格搜索在有效探索超参数空间方面的局限性。