对于同一数据集,one_hot_encode和count_vectorizer之间的准确度如何不同?

问题描述 投票:0回答:1

onehot_enc, BernoulliNB:

在这里,我使用了两个不同的文件进行评论和标签,我使用“train_test_split”将数据随机分成80%的列车数据和20%的测试数据。

reviews.txt:

Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear

labels.txt:

positive
negative
positive
negative

My Code:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix

with open("/Users/abc/reviews.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]

onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)


bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)

score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%

predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)

print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%

CountVectorizer, MultinomialNB:

在这里,我手动将相同的数据分成火车(80%)和测试(20%)。我将这两个csv文件提供给算法。

但是,与上述方法相比,这提供的准确性较低。任何人都可以帮我解决同样的事情......

train_data.csv:

   review,label
    Colors & clarity is superb,positive
    Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

test_data.csv:

 review,label
    The picture is clear and beautiful,positive
    Picture is not clear,negative

My Code:

from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score


def load_data(filename):
    reviews = list()
    labels = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            labels.append(line[1])
            reviews.append(line[0])

    return reviews, labels

X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')

vec = CountVectorizer() 

X_train_transformed =  vec.fit_transform(X_train) 

X_test_transformed = vec.transform(X_test)

clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)

score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%

y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))

print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
python machine-learning scikit-learn naivebayes
1个回答
0
投票

这里的问题是你使用MultiLabelBinarizer

onehot_enc.fit(reviews_tokens)` 

在分成列车和测试之前,测试数据泄漏到模型中,因此具有更高的准确性。

另一方面,当你使用CountVectorizer时,只能看到训练好的数据,然后忽略训练数据中没有出现的单词,这对于分类模型可能是有价值的。

因此,根据您的数据量,这可能会产生巨大的差异。无论如何,你的第二种技术(使用CountVectorizer)是正确的,应该在文本数据的情况下使用。一般来说,MultiLabelBinarizer和one-hot编码只能用于分类数据,而不能用于文本数据。

你能分享你的完整数据吗?

© www.soinside.com 2019 - 2024. All rights reserved.