Optuna 用于随机森林分类器的修剪轨迹

问题描述 投票:0回答:1

我目前正在研究 Optuna 库,我发现有一个参数可以删除没有希望的试验。看来这个参数只能与SGD分类器等增量学习方法或神经网络一起使用。 因此,我想知道,在使用随机森林、决策 CART 甚至逻辑回归时是否可以使用剪枝试验?

非常感谢! :)

PS:我在互联网上没有找到任何使用随机森林并使用 optuna 进行修剪试验的示例...

logistic-regression random-forest cart hyperparameters optuna
1个回答
0
投票

SGDClassifier
loss='cross_entropy'
执行逻辑回归,使您能够使用增量学习进行逻辑回归。

对于随机森林和决策树;它们是批量学习器,因此尝试修剪不适用。但是,您可以将批量学习器包装在一个类中(下面的

PseudoIncrementalBatchLearner
),每次调用
partial_fit()
时,该类都会根据越来越多的数据重新调整学习器。这类似于学习曲线的生成方式,其中估计器根据数据集的增加部分进行重新拟合。

一般来说,当学习器适应较大部分的数据时,其泛化误差和偏差将会下降——对于个体估计器来说,这种趋势在某种程度上是可以预测的。然而,当比较学习器时,您可能想修剪那些改进相对较慢的学习器,并且在整个数据集上训练成本太高......这就是下面的

PsueoIncrementalBatchLearner
可能有用的地方。

下面的数据+示例显示了与橙色随机森林相比,蓝色随机森林的改进速度如何缓慢,因此蓝色随机森林是早期剪枝的候选者。这避免了您必须在完整数据集上训练学习者(尽管最终它们是可比较的)。

from sklearn import model_selection
from sklearn import base
import numpy as np

#
#Wraps a batch learner, training it on larger portions of the data
# each time partial_fit() is called
#
class PseudoIncrementalBatchLearner(
    base.BaseEstimator,
    base.MetaEstimatorMixin,
    base.ClassifierMixin,
    base.RegressorMixin
):
    def __init__(self, estimator, max_steps=20, random_state=None):
        self.estimator = estimator
        self.max_steps = max_steps
        self.random_state = random_state
    
    def partial_fit(self, X, y):
        X, y = base.check_X_y(X, y)
            
        if hasattr(X, 'columns'):
            self.feature_names_in_ = np.array(
                X.columns, dtype='object'
            )
        self.n_features_in_ = X.shape[1]
        
        if not hasattr(self, 'current_step_'):
            self.current_step_ = 0
        
        #Get ShuffleSplit/StratifiedShuffleSplit for regressor/classifier
        cv = getattr(
            model_selection,
            ('Stratified' if base.is_classifier(self.estimator) else '') + 'ShuffleSplit'
        )
        
        #Shuffle and split off the required size for this step
        if self.current_step_ + 1 < self.max_steps:
            train_ix, _ = next(cv(
                n_splits=1, train_size=(self.current_step_ + 1) / self.max_steps
            ).split(X, y))
        else:
            train_ix = np.arange(len(X))

        #Beyond max_steps, no more refitting, as already fit on all data.
        # Could optionally comment this part out.
        if self.current_step_ + 1 > self.max_steps:
            return self
        
        #Refit estimator on the current portion of the dataset
        self.estimator_ = base.clone(self.estimator).fit(X[train_ix], y[train_ix])
        self.current_step_ += 1
        return self
    
    def predict(self, X):
        return self.estimator_.predict(X)


#
#Make test dataset
#
from matplotlib import pyplot as plt
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.2, random_state=0)
X_val, y_val = make_moons(n_samples=200, noise=0.2, random_state=1)

plt.scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
plt.gcf().set_size_inches(6, 3)
plt.show()

#Create two classifiers to see which learns in fewer steps.
from sklearn.ensemble import RandomForestClassifier
rf0 = RandomForestClassifier(n_estimators=10, random_state=np.random.RandomState(0))
rf1 = RandomForestClassifier(n_estimators=100, random_state=np.random.RandomState(1))

pi_rf0 = PseudoIncrementalBatchLearner(rf0, random_state=np.random.RandomState(0))
pi_rf1 = PseudoIncrementalBatchLearner(rf1, random_state=np.random.RandomState(1))

#Run pseudo-incremental training (training on larger portions of same data, each step)
val_scores0, val_scores1 = [], []
for i in range(pi_rf0.max_steps):
    pi_rf0.partial_fit(X, y)
    pi_rf1.partial_fit(X, y)
    
    val_scores0.append(pi_rf0.score(X_val, y_val))
    val_scores1.append(pi_rf1.score(X_val, y_val))

#Plot results
plt.plot(val_scores0, lw=2, label='rf0 validation')
plt.plot(val_scores1, lw=2, label='rf1 validation')
plt.xlabel('training "step" (i.e. proportion of the training data)')
plt.gca().set_xticks(range(pi_rf0.max_steps))
plt.ylabel('accuracy')
plt.gcf().set_size_inches(8, 2.5)
plt.gcf().legend()
© www.soinside.com 2019 - 2024. All rights reserved.