cv accuracy cv accuracy graph test accuracy
我正在尝试在亚马逊的精美食品评论数据集上实施Naive bayes。您能否查看代码并说明为什么交叉验证准确性和测试准确性之间存在如此大的差异?
从概念上讲,下面的代码有什么问题吗?
#BOW()
from sklearn.feature_extraction.text import CountVectorizer
bow = CountVectorizer(ngram_range = (2,3))
bow_vect = bow.fit(X_train["F_review"].values)
bow_sparse = bow_vect.transform(X_train["F_review"].values)
X_bow = bow_sparse
y_bow = y_train
roc = []
accuracy = []
f1 = []
k_value = []
for i in range(1,50,2):
BNB =BernoulliNB(alpha =i)
print("************* for alpha = ",i,"*************")
x = (cross_validate(BNB, X_bow,y_bow, scoring = ['accuracy','f1','roc_auc'], return_train_score = False, cv = 10))
print(x["test_roc_auc"].mean())
print("-----c------break------c-------break-------c-----------")
roc.append(x['test_roc_auc'].mean())#This is the ROC metric
accuracy.append(x['test_accuracy'].mean())#This is the accuracy metric
f1.append(x['test_f1'].mean())#This is the F1 score
k_value.append(i)
#BOW Test prediction
BNB =BernoulliNB(alpha= 1)
BNB.fit(X_bow, y_bow)
y_pred = BNB.predict(bow_vect.transform(X_test["F_review"]))
print("Accuracy Score: ",accuracy_score(y_test,y_pred))
print("ROC: ", roc_auc_score(y_test,y_pred))
print("Confusion Matrix: ", confusion_matrix(y_test,y_pred))
使用其中一个指标来查找最佳Alpha值。然后训练BernoulliNB测试数据。
并且不考虑性能测量的准确性,因为它容易出现不平衡的数据集。
在做任何事情之前,请在评论中更改Kalsi提到的循环中给出的值。