我目前正在研究 Optuna 库,我发现有一个参数可以删除没有希望的试验。看来这个参数只能与增量学习方法(例如SGD分类器)或神经网络一起使用。 因此,我想知道,在使用随机森林、决策 CART 甚至逻辑回归时是否可以使用剪枝试验?
非常感谢! :)
PS:我在互联网上没有找到任何使用随机森林并使用 optuna 进行修剪试验的示例...
SGDClassifier
与 loss='cross_entropy'
执行逻辑回归,使您能够使用增量学习进行逻辑回归。
对于随机森林和决策树;它们是批量学习器,因此尝试修剪不适用。但是,您可以将批量学习器包装在一个类中(下面的
PseudoIncrementalBatchLearner
),每次调用partial_fit()
时,该类都会根据越来越多的数据重新调整学习器。这类似于学习曲线的生成方式,其中估计器根据数据集的增加部分进行重新拟合。
一般来说,当学习器适应较大部分的数据时,其泛化误差和偏差将会下降——这种趋势对于个体估计器来说是可以预测的。然而,当比较学习器时,您可能想修剪那些改进相对较慢的学习器,并且在整个数据集上训练成本太高......这就是下面的
PsueoIncrementalBatchLearner
可能有用的地方。
下面的数据+示例显示了与橙色随机森林相比,蓝色随机森林的改进速度如何缓慢,因此蓝色随机森林是早期剪枝的候选者。这避免了您必须在完整数据集上训练学习者(尽管最终它们是可比较的)。
from sklearn import model_selection
from sklearn import base
import numpy as np
#
#Wraps a batch learner, training it on larger portions of the data
# each time partial_fit() is called
#
class PseudoIncrementalBatchLearner(
base.BaseEstimator,
base.MetaEstimatorMixin,
base.ClassifierMixin,
base.RegressorMixin
):
def __init__(self, estimator, max_steps=20, random_state=None):
self.estimator = estimator
self.max_steps = max_steps
self.random_state = random_state
def partial_fit(self, X, y):
X, y = base.check_X_y(X, y)
if hasattr(X, 'columns'):
self.feature_names_in_ = np.array(
X.columns, dtype='object'
)
self.n_features_in_ = X.shape[1]
if not hasattr(self, 'current_step_'):
self.current_step_ = 0
#Get ShuffleSplit/StratifiedShuffleSplit for regressor/classifier
cv = getattr(
model_selection,
('Stratified' if base.is_classifier(self.estimator) else '') + 'ShuffleSplit'
)
#Shuffle and split off the required size for this step
if self.current_step_ + 1 < self.max_steps:
train_ix, _ = next(cv(
n_splits=1, train_size=(self.current_step_ + 1) / self.max_steps
).split(X, y))
else:
train_ix = np.arange(len(X))
#Beyond max_steps, no more refitting, as already fit on all data.
# Could optionally comment this part out.
if self.current_step_ + 1 > self.max_steps:
return self
#Refit estimator on the current portion of the dataset
self.estimator_ = base.clone(self.estimator).fit(X[train_ix], y[train_ix])
self.current_step_ += 1
return self
def predict(self, X):
return self.estimator_.predict(X)
#
#Make test dataset
#
from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.2, random_state=0)
X_val, y_val = make_moons(n_samples=200, noise=0.2, random_state=1)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.gcf().set_size_inches(6, 3)
plt.show()
#Create two classifiers to see which learns in fewer steps.
from sklearn.ensemble import RandomForestClassifier
rf0 = RandomForestClassifier(n_estimators=10, random_state=np.random.RandomState(0))
rf1 = RandomForestClassifier(n_estimators=100, random_state=np.random.RandomState(1))
pi_rf0 = PseudoIncrementalBatchLearner(rf0, random_state=np.random.RandomState(0))
pi_rf1 = PseudoIncrementalBatchLearner(rf1, random_state=np.random.RandomState(1))
#Run pseudo-incremental training (training on larger portions of same data, each step)
val_scores0, val_scores1 = [], []
for i in range(pi_rf0.max_steps):
pi_rf0.partial_fit(X, y)
pi_rf1.partial_fit(X, y)
val_scores0.append(pi_rf0.score(X_val, y_val))
val_scores1.append(pi_rf1.score(X_val, y_val))
#Plot results
plt.plot(val_scores0, lw=2, label='rf0 validation')
plt.plot(val_scores1, lw=2, label='rf1 validation')
plt.xlabel('training "step" (i.e. proportion of the training data)')
plt.gca().set_xticks(range(pi_rf0.max_steps))
plt.ylabel('accuracy')
plt.gcf().set_size_inches(8, 2.5)
plt.gcf().legend()