机器学习假新闻检测器

问题描述 投票:0回答:0

我正在做一个项目,需要我制作一个程序来接受用户输入并预测它是真的还是假的。我创建了一个训练和测试数据的笔记本,它的准确率为 90%,然后在使用矢量化器导出模型后,我为 main.py 文件编写了一些代码,我将在其中加载模型和矢量化器,预处理文本从用户那里,使用 TfidfVectorizer 将预处理的输入转换为数字表示,对输入向量进行预测并最终打印结果。我还想添加一项功能,允许用户使用输入浏览新闻媒体,但稍后...... 问题是,无论我输入什么,它都会返回 true,每次为 0.88+,无论它是假新闻还是 lorem ipsum。

在方括号中看到相同百分比的香蕉是黄色和红色的。有人可以帮助我确定问题吗?

文件在我的 github 上,除了导出的模型和数据集(模型大约 500mb,数据集大约 200 mb)。 https://github.com/mrmatix/fakeapp

我将非常感谢对该主题的任何帮助。这是我的 main.py 文件:

import joblib
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import requests
from bs4 import BeautifulSoup
import webbrowser

# Load the XGB model
model = joblib.load('model/xgb.joblib')

# Load the TfidfVectorizer object used during training
vectorizer = joblib.load('vectorizer/tfidf.joblib')

webbrowser.BackgroundBrowser("C:/Program Files/Mozilla Firefox/firefox.exe")


def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove non-alphabetic characters
    text = re.sub('[^a-zA-Z]', ' ', text)

    # Tokenize the text
    words = text.split()

    # Join the words back into a string with space as separator
    clean_text = ' '.join(words)

    return clean_text

while True:
    print("Welcome to the Fake Reader App where you can check if a phrase is real or fake.")
    # Get user input
    user_input = input("Enter a phrase or 'quit' to exit: ")

    # Add this line at the beginning of the while loop DEBUG
    print(f"user_input = {user_input}")

    if user_input == 'quit':
        break

    # Preprocess the user input
    processed_input = preprocess_text(user_input)

    # Print the processed_input to check if the preprocessing is working correctly DEBUG
    print(f"processed_input = {processed_input}")

    # Convert the preprocessed input to a numerical representation using TfidfVectorizer
    input_vector = vectorizer.transform([processed_input])

    # Print the input_vector to check if the TfidfVectorizer is working correctly DEBUG
    print(f"input_vector = {input_vector}")

    # Make predictions on the input vector
    preds = model.predict(input_vector)

    # Print the preds to check the model prediction DEBUG
    print(f"preds = {preds}")

    print("printing the probability of the prediction")
    # Print the probability of the prediction
    print(model.predict_proba(input_vector))

    print("printing the prediction")
    # Print the prediction
    print(model.predict(input_vector))

    # Print the prediction
    if preds[0] < 0.98:
        print("The phrase is possibly fake.")
    else:
        print("The phrase is possibly real.")

这里是数据集: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/

总的来说,我对机器学习和人工智能还很陌生,我不知道模型是否过度拟合或过度训练或其他什么。如果有人能伸出援手,我将不胜感激

python machine-learning classification naivebayes
© www.soinside.com 2019 - 2024. All rights reserved.