在手动交叉验证和cross_val_score之间获得不同的分数值

问题描述 投票:0回答:1

我创建了一个 python for 循环,将训练数据集分割成分层的 KFold,并在循环内使用分类器来训练它。然后使用经过训练的模型通过验证数据进行预测。使用此过程实现的指标与使用 cross_val_score 函数实现的指标完全不同。我期望使用这两种方法得到相同的结果。

此代码用于文本分类,我使用 TF-IDF 对文本进行矢量化

这是代码:

手动实现交叉验证的代码:

#Importing metrics functions to measure performance of a  model
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import StratifiedKFold
data_validation = []  # list used to store the results of model validation using cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_val = []
f1_val = []

# use ravel function to flatten the multi-dimensional array to a single dimension
for train_index, val_index in (skf.split(X_train, y_train)):
    X_tr, X_val = X_train.ravel()[train_index], X_train.ravel()[val_index] 
    y_tr, y_val  = y_train.ravel()[train_index] , y_train.ravel()[val_index]
    tfidf=TfidfVectorizer()
    X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
    X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold    
    #instantiate model 
    model= MultinomialNB(alpha=0.5, fit_prior=False) 
    #Training the empty model with our training dataset
    model.fit(X_tr_vec_tfidf, y_tr)  
    predictions_val = model.predict(X_val_vec_tfidf) # make predictions with the validation dataset
    acc_val = accuracy_score(y_val, predictions_val)
    accuracy_val.append(acc_val)
    f_val=f1_score(y_val, predictions_val)
    f1_val.append(f_val)

avg_accuracy_val = np.mean(accuracy_val)
avg_f1_val = np.mean(f1_val)

# temp list to store the metrics 
temp = ['NaiveBayes']
temp.append(avg_accuracy_val)   #validation accuracy score 
temp.append(avg_f1_val)         #validation f1 score
data_validation.append(temp)    
#Create a table ,using dataframe, which contains the metrics for all the trained and tested ML models
result = pd.DataFrame(data_validation, columns = ['Algorithm','Accuracy Score : Validation','F1-Score  : Validation'])
result.reset_index(drop=True, inplace=True)
result      

输出:

    Algorithm   Accuracy Score : Validation     F1-Score : Validation
0   NaiveBayes  0.77012                      0.733994

现在使用 cross_val_score 函数的代码:

from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
scores = ['accuracy', 'f1']
#Text vectorization of training and testing datasets using NLP technique TF-IDF
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nb=MultinomialNB(alpha=0.5, fit_prior=False) 
for score in ["accuracy", "f1"]:
    print (f'{score}: {cross_val_score(nb,X_tr_vec_tfidf,y_train,cv=skf,scoring=score).mean()} ')

输出:

accuracy: 0.7341283583255231 
f1: 0.7062017090972422 

可以看出,使用这两种方法的准确性和 f1 指标有很大不同。当我使用 KNeighborsClassfier 时,指标的差异更加严重。

python machine-learning scikit-learn cross-validation text-classification
1个回答
0
投票

TL;DR:由于处理 TF-IDF 变换的方式不同,两种计算方式等效; 第一个计算是正确的。


在第一个计算中,您仅将

fit_transform
正确应用于每个折叠的训练数据 ,并将 transform
 应用于验证数据折叠:

X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold
但是在第二次计算中你没有这样做;相反,您可以将 

fit_transform

 应用于整个训练数据,然后再将其拆分为训练和验证折叠:

X_tr_vec_tfidf = tfidf.fit_transform(X_train)
因此存在差异(事实上,您似乎使用第二种错误的计算方式获得了更好的准确性,这一事实是无关紧要的)。


当我们进行转换时,使用

cross_val_score

的正确方法是通过
管道API用户指南):

from sklearn.pipeline import Pipeline tfidf = TfidfVectorizer() nb = MultinomialNB(alpha=0.5, fit_prior=False) pipeline = Pipeline([('transformer', tfidf), ('estimator', nb)]) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipeline, X_train, y_train, cv = skf)
    
© www.soinside.com 2019 - 2024. All rights reserved.