如何使用“onehot_enc”实现朴素贝叶斯算法?

问题描述 投票:1回答:1

我已经按照这个site为我的数据集使用朴素贝叶斯算法。这里数据集分为两个文件,一个是review.txt,另一个是label.txt。我在这里使用了“train_test_split”功能。

My code :

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix

with open("/Users/abc/review.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/abc/label.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]

onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)


bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)

score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score)

predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)

print("precision_score :" , precision_score)
print("recall_score :" , recall_score)

但是,现在我的要求是在单个文件中有数据集(评论,标签)。我需要分别手动提供测试和训练数据。因此,相应地实现了代码。

但是,我不能在这里使用“onehot_enc”。它抛出错误,因为从“load_data”函数返回的评论是单词列表的列表。

任何人都可以建议我如何使用“onehot_enc”为我的数据集实现我的代码...

所以,为此,我使用了以下代码:

train_data.csv:

review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

test_data.csv:

review,label
The picture is clear and beautiful,positive
Picture is not clear,negative

New Code: (Supplying reviews,labels in a single csv file)

from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score


def load_data(filename):
    reviews = list()
    labels = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            labels.append(line[1])
            reviews.append(line[0])

    return reviews, labels

X_train, y_train = load_data('/Users/abc/train_data.csv')
X_test, y_test = load_data('/Users/abc/test_data.csv')
python machine-learning scikit-learn
1个回答
0
投票

如果我理解正确,你想要的是将你的评论标记为使用朴素贝叶斯。一种热编码用于标签或分类数据。

你应该在你的标签上使用它来获得0和1而不是正面和负面而不是你的评论

对于您的文本,有一些内置于sklearn中的函数可以进行标记化,通常CountVectorizer可能会在这里工作。

我建议看看follwing link,它解释了如何详细处理文本。

© www.soinside.com 2019 - 2024. All rights reserved.