我目前正在开发一个项目,尝试将 Word2Vec 与多项式朴素贝叶斯 (MultinomialNB) 结合使用来进行准确性计算。
import pandas as pd
import numpy as np, sys
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score
from datasets import load_dataset
df = load_dataset('celsowm/bbc_news_ptbr', split='train')
X = df['texto']
y = df['categoria']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
sentences = [sentence.split() for sentence in X_train]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)
def vectorize(sentence):
words = sentence.split()
words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
if len(words_vecs) == 0:
return np.zeros(100)
words_vecs = np.array(words_vecs)
return words_vecs.mean(axis=0)
X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])
clf = MultinomialNB()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label='positive'))
但是,我遇到了一个错误:
ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to MultinomialNB (input X)
如果您能提供解决此问题的任何见解,我将不胜感激。
来自
scikit-learn
文档,
多项式朴素贝叶斯分类器适用于具有离散特征的分类(例如,文本分类的字数统计)。多项分布通常需要整数特征计数。然而,在实践中,诸如 tf-idf 之类的分数计数也可能有效。
您正在传递词向量作为输入。一般来说,词向量包含浮点数。这就是为什么你的方法行不通。