scikit-learn:如何使用验证数据集来提高分类器的性能?

问题描述 投票:0回答:1

我正在使用线性SVC pipeline构建多标签分类器,以训练每个分类器。我正在采用this文章中有关数据集的代码:

train = df_train # training data-set
X_train = df_train.X # training data-set without the labels
test = df_test # test data-set
X_test = df_test.X # test data-set without the labels
validation = df_validate # validation data-set
X_validation = df_validate.X # validation data-set without the labels

SVC_pipeline = Pipeline([
            ('tfidf', TfidfVectorizer(stop_words=stop_words)),
            ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
        ])
# iterating through all the labels
for category in categories:
    print('... Processing {}'.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train, train[category])
    y_pred_train = SVC_pipeline.predict(X_train)
    #choosing the validation data
    y_pred_validation = SVC_pipeline.predict(X_validation)
    # using f1 score as metric
    train_score = f1_score(train[category], y_pred_train, average='micro')
    val_score = f1_score(validation[category], y_pred_validation, average='micro')

    # show results
    print("training F1: {}".format(train_score))
    print("validation F1: {}".format(val_score))

    # compute the testing accuracy
    prediction = SVC_pipeline.predict(X_test)
    print('Test accuracy is {}'.format(accuracy_score(test[category], prediction)))

在上面的代码中,我能够分别计算每个标签(category)的训练,验证和测试数据集的准确性得分。验证数据集如何用于提高训练分类器的准确性?

python validation scikit-learn svm multilabel-classification
1个回答
0
投票

我认为这是机器学习中的一个基本问题,与堆栈溢出完全无关。我建议阅读有关该主题的this Wiki-可能有助于清除问题。

© www.soinside.com 2019 - 2024. All rights reserved.