我必须用于二进制分类问题的梯度推进分类下面的代码。
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
#Creating training and test dataset
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.30,random_state=1)
#Count of goods in the training set
#This count is 50000
y0 = len(y_train[y_train['bad_flag'] == 0])
#Count of bads in the training set
#This count is 100
y1 = len(y_train[y_train['bad_flag'] == 1])
#Creating the sample_weights array. Include all bad customers and
#twice the number of goods as bads
w0=(y1/y0)*2
w1=1
sample_weights = np.zeros(len(y_train))
sample_weights[y_train['bad_flag'] == 0] = w0
sample_weights[y_train['bad_flag'] == 1] = w1
model=GradientBoostingClassifier(
n_estimators=100,max_features=0.5,random_state=1)
model=model.fit(X_train, y_train.values.ravel(),sample_weights)
我对编写这些代码思路如下: -
GradientBoostingClassifier(subsample = 1.0)
这意味着,将在每一个阶段(每个n_estimators
的)中使用的样本量将是相同的原始数据集。权重将不会改变任何东西在子样本的大小。如果要强制每个阶段300个观察,你需要设置subsample = 300/(50000+100)
除了重定义。subsample
会被吸引。你可以阅读更多关于它在这里:https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting。它说:
在每次迭代时基分类器上的可用训练数据的一小部分的子样本训练。
所以,作为一个结果,有一定的Bootstrap与提升算法相结合。