我创建了一个 python for 循环,将训练数据集分割成分层的 KFold,并在循环内使用分类器来训练它。然后使用经过训练的模型通过验证数据进行预测。使用此过程实现的指标与使用 cross_val_score 函数实现的指标完全不同。我期望使用这两种方法得到相同的结果。
此代码用于文本分类,我使用 TF-IDF 对文本进行矢量化
这是代码:
手动实现交叉验证的代码:
#Importing metrics functions to measure performance of a model
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import StratifiedKFold
data_validation = [] # list used to store the results of model validation using cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_val = []
f1_val = []
# use ravel function to flatten the multi-dimensional array to a single dimension
for train_index, val_index in (skf.split(X_train, y_train)):
X_tr, X_val = X_train.ravel()[train_index], X_train.ravel()[val_index]
y_tr, y_val = y_train.ravel()[train_index] , y_train.ravel()[val_index]
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold
#instantiate model
model= MultinomialNB(alpha=0.5, fit_prior=False)
#Training the empty model with our training dataset
model.fit(X_tr_vec_tfidf, y_tr)
predictions_val = model.predict(X_val_vec_tfidf) # make predictions with the validation dataset
acc_val = accuracy_score(y_val, predictions_val)
accuracy_val.append(acc_val)
f_val=f1_score(y_val, predictions_val)
f1_val.append(f_val)
avg_accuracy_val = np.mean(accuracy_val)
avg_f1_val = np.mean(f1_val)
# temp list to store the metrics
temp = ['NaiveBayes']
temp.append(avg_accuracy_val) #validation accuracy score
temp.append(avg_f1_val) #validation f1 score
data_validation.append(temp)
#Create a table ,using dataframe, which contains the metrics for all the trained and tested ML models
result = pd.DataFrame(data_validation, columns = ['Algorithm','Accuracy Score : Validation','F1-Score : Validation'])
result.reset_index(drop=True, inplace=True)
result
输出:
Algorithm Accuracy Score : Validation F1-Score : Validation
0 NaiveBayes 0.77012 0.733994
现在使用 cross_val_score 函数的代码:
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
scores = ['accuracy', 'f1']
#Text vectorization of training and testing datasets using NLP technique TF-IDF
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
X_tst_vec_tfidf = tfidf.transform(X_test)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nb=MultinomialNB(alpha=0.5, fit_prior=False)
for score in ["accuracy", "f1"]:
print (f'{score}: {cross_val_score(nb,X_tr_vec_tfidf,y_train,cv=skf,scoring=score).mean()} ')
输出:
accuracy: 0.7341283583255231
f1: 0.7062017090972422
可以看出,使用这两种方法的准确性和 f1 指标有很大不同。当我使用 KNeighborsClassfier 时,指标的差异更加严重。
TL;DR:由于处理 TF-IDF 变换的方式不同,两种计算方式不等效; 第一个计算是正确的。
在第一个计算中,您仅将
fit_transform
正确应用于每个折叠的训练数据 ,并将 transform
应用于验证数据折叠:
X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold
但是在第二次计算中你没有这样做;相反,您可以将 fit_transform
应用于整个训练数据,然后再将其拆分为训练和验证折叠:
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
因此存在差异(事实上,您似乎使用第二种磨损的计算方式获得了更好的准确性,这一事实是无关紧要的)。
cross_val_score
的正确方法是通过管道(API、用户指南):
from sklearn.pipeline import Pipeline
tfidf = TfidfVectorizer()
nb = MultinomialNB(alpha=0.5, fit_prior=False)
pipeline = Pipeline([('transformer', tfidf), ('estimator', nb)])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv = skf)