使用 Word2Vec 进行向量化时传递到 MultinomialNB 的数据中的负值

问题描述 投票:0回答:1

我目前正在开发一个项目,尝试将 Word2Vec 与多项式朴素贝叶斯 (MultinomialNB) 结合使用来进行准确性计算。

import pandas as pd
import numpy as np, sys
from sklearn.model_selection import train_test_split
from gensim.models import Word2Vec
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score
from datasets import load_dataset

df = load_dataset('celsowm/bbc_news_ptbr', split='train')
X = df['texto']
y = df['categoria']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

sentences = [sentence.split() for sentence in X_train]
w2v_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)

def vectorize(sentence):
    words = sentence.split()
    words_vecs = [w2v_model.wv[word] for word in words if word in w2v_model.wv]
    if len(words_vecs) == 0:
        return np.zeros(100)
    words_vecs = np.array(words_vecs)
    return words_vecs.mean(axis=0)

X_train = np.array([vectorize(sentence) for sentence in X_train])
X_test = np.array([vectorize(sentence) for sentence in X_test])
clf = MultinomialNB()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred, pos_label='positive'))

但是,我遇到了一个错误:

ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to MultinomialNB (input X)

如果您能提供解决此问题的任何见解,我将不胜感激。

python scikit-learn gensim word2vec naivebayes
1个回答
0
投票

来自

scikit-learn
文档

多项式朴素贝叶斯分类器适用于具有离散特征的分类(例如,文本分类的字数统计)。多项分布通常需要整数特征计数。然而,在实践中,诸如 tf-idf 之类的分数计数也可能有效。

您正在传递词向量作为输入。一般来说,词向量包含浮点数。这就是为什么你的方法行不通。

© www.soinside.com 2019 - 2024. All rights reserved.