我正在使用线性SVC pipeline构建多标签分类器,以训练每个分类器。我正在采用this文章中有关数据集的代码:
train = df_train # training data-set
X_train = df_train.X # training data-set without the labels
test = df_test # test data-set
X_test = df_test.X # test data-set without the labels
validation = df_validate # validation data-set
X_validation = df_validate.X # validation data-set without the labels
SVC_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words=stop_words)),
('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
])
# iterating through all the labels
for category in categories:
print('... Processing {}'.format(category))
# train the model using X_dtm & y
SVC_pipeline.fit(X_train, train[category])
y_pred_train = SVC_pipeline.predict(X_train)
#choosing the validation data
y_pred_validation = SVC_pipeline.predict(X_validation)
# using f1 score as metric
train_score = f1_score(train[category], y_pred_train, average='micro')
val_score = f1_score(validation[category], y_pred_validation, average='micro')
# show results
print("training F1: {}".format(train_score))
print("validation F1: {}".format(val_score))
# compute the testing accuracy
prediction = SVC_pipeline.predict(X_test)
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))
在上面的代码中,我能够分别计算每个标签(category
)的训练,验证和测试数据集的准确性得分。验证数据集如何用于提高训练分类器的准确性?
我认为这是机器学习中的一个基本问题,与堆栈溢出完全无关。我建议阅读有关该主题的this Wiki-可能有助于清除问题。