我正在做一个项目,需要我制作一个程序来接受用户输入并预测它是真的还是假的。我创建了一个训练和测试数据的笔记本,它的准确率为 90%,然后在使用矢量化器导出模型后,我为 main.py 文件编写了一些代码,我将在其中加载模型和矢量化器,预处理文本从用户那里,使用 TfidfVectorizer 将预处理的输入转换为数字表示,对输入向量进行预测并最终打印结果。我还想添加一项功能,允许用户使用输入浏览新闻媒体,但稍后......
问题是,无论我输入什么,它都会返回 true,每次为 0.88+,无论它是假新闻还是 lorem ipsum。
在方括号中看到相同百分比的香蕉是黄色和红色的。有人可以帮助我确定问题吗?
文件在我的 github 上,除了导出的模型和数据集(模型大约 500mb,数据集大约 200 mb)。 https://github.com/mrmatix/fakeapp
我将非常感谢对该主题的任何帮助。这是我的 main.py 文件:
import joblib
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import requests
from bs4 import BeautifulSoup
import webbrowser
# Load the XGB model
model = joblib.load('model/xgb.joblib')
# Load the TfidfVectorizer object used during training
vectorizer = joblib.load('vectorizer/tfidf.joblib')
webbrowser.BackgroundBrowser("C:/Program Files/Mozilla Firefox/firefox.exe")
def preprocess_text(text):
# Convert to lowercase
text = text.lower()
# Remove non-alphabetic characters
text = re.sub('[^a-zA-Z]', ' ', text)
# Tokenize the text
words = text.split()
# Join the words back into a string with space as separator
clean_text = ' '.join(words)
return clean_text
while True:
print("Welcome to the Fake Reader App where you can check if a phrase is real or fake.")
# Get user input
user_input = input("Enter a phrase or 'quit' to exit: ")
# Add this line at the beginning of the while loop DEBUG
print(f"user_input = {user_input}")
if user_input == 'quit':
break
# Preprocess the user input
processed_input = preprocess_text(user_input)
# Print the processed_input to check if the preprocessing is working correctly DEBUG
print(f"processed_input = {processed_input}")
# Convert the preprocessed input to a numerical representation using TfidfVectorizer
input_vector = vectorizer.transform([processed_input])
# Print the input_vector to check if the TfidfVectorizer is working correctly DEBUG
print(f"input_vector = {input_vector}")
# Make predictions on the input vector
preds = model.predict(input_vector)
# Print the preds to check the model prediction DEBUG
print(f"preds = {preds}")
print("printing the probability of the prediction")
# Print the probability of the prediction
print(model.predict_proba(input_vector))
print("printing the prediction")
# Print the prediction
print(model.predict(input_vector))
# Print the prediction
if preds[0] < 0.98:
print("The phrase is possibly fake.")
else:
print("The phrase is possibly real.")
这里是数据集: https://www.kaggle.com/datasets/saurabhshahane/fake-news-classification/
总的来说,我对机器学习和人工智能还很陌生,我不知道模型是否过度拟合或过度训练或其他什么。如果有人能伸出援手,我将不胜感激