我已经按照这个site为我的数据集使用朴素贝叶斯算法。这里数据集分为两个文件,一个是review.txt,另一个是label.txt。我在这里使用了“train_test_split”功能。
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/review.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/label.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score)
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score)
print("recall_score :" , recall_score)
但是,现在我的要求是在单个文件中有数据集(评论,标签)。我需要分别手动提供测试和训练数据。因此,相应地实现了代码。
但是,我不能在这里使用“onehot_enc”。它抛出错误,因为从“load_data”函数返回的评论是单词列表的列表。
任何人都可以建议我如何使用“onehot_enc”为我的数据集实现我的代码...
所以,为此,我使用了以下代码:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/train_data.csv')
X_test, y_test = load_data('/Users/abc/test_data.csv')
如果我理解正确,你想要的是将你的评论标记为使用朴素贝叶斯。一种热编码用于标签或分类数据。
你应该在你的标签上使用它来获得0和1而不是正面和负面而不是你的评论
对于您的文本,有一些内置于sklearn中的函数可以进行标记化,通常CountVectorizer可能会在这里工作。
我建议看看follwing link,它解释了如何详细处理文本。