我的目标是创建一个管道来处理预处理并进行嵌套交叉验证以防止信息泄漏。我为每个模型制作一个管道,然后比较性能并选择最佳模型。
问题:
如有任何其他批评,我们将不胜感激。
#Organising data
X = df.drop('target', axis= 1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
pipe_rf = Pipeline([('scl', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))])
param_range = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
grid_params_rf = [{'clf__criterion': ['gini', 'entropy'],
'clf__min_samples_leaf': param_range,
'clf__max_depth': param_range,
'clf__min_samples_split': param_range[1:]}]
gs_rf = GridSearchCV(estimator=pipe_rf,
param_grid=grid_params_rf,
scoring=['accuracy', 'f1', 'recall'],
refit = 'accuracy',
cv=10,
n_jobs=jobs)
gs_rf.fit(X_train, y_train)
#Get out training scores:
gs_rf.best_params_
gs_rf.best_score_ #train accuracy
#train f1
#train recall
#Find out how well it generalises by predicting using x_test and comparing predictions to y_test
y_predict = gs_rf.predict(x_test)
accuracy_score(y_test, y_predict) #test accuracy
recall_score(y_test, y_predict) #test recall
f1_score(y_test, y_predict) #test f1
#Evaluating the model (using this value to compare all of my different models, e.g. RF, SVM, DT)
scor = cross_validate(gs_rf, x_test, y_test, scoring=['accuracy', 'f1', 'recall'], cv=5, n_jobs = -1)
GridSearchCV
对象保留最佳估计器的每个指标的交叉验证分数。您可以使用 cv_results_
属性提取这些分数。以下是获取训练 F1 和回忆分数的方法:f1_train_scores = gs_rf.cv_results_['mean_train_f1']
recall_train_scores = gs_rf.cv_results_['mean_train_recall']
我希望这有帮助!