在这里,我使用了两个不同的文件进行评论和标签,我使用“train_test_split”将数据随机分成80%的列车数据和20%的测试数据。
Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear
positive
negative
positive
negative
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%
在这里,我手动将相同的数据分成火车(80%)和测试(20%)。我将这两个csv文件提供给算法。
但是,与上述方法相比,这提供的准确性较低。任何人都可以帮我解决同样的事情......
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
这里的问题是你使用MultiLabelBinarizer
:
onehot_enc.fit(reviews_tokens)`
在分成列车和测试之前,测试数据泄漏到模型中,因此具有更高的准确性。
另一方面,当你使用CountVectorizer
时,只能看到训练好的数据,然后忽略训练数据中没有出现的单词,这对于分类模型可能是有价值的。
因此,根据您的数据量,这可能会产生巨大的差异。无论如何,你的第二种技术(使用CountVectorizer
)是正确的,应该在文本数据的情况下使用。一般来说,MultiLabelBinarizer
和one-hot编码只能用于分类数据,而不能用于文本数据。
你能分享你的完整数据吗?