为什么我的逻辑回归模型会重复预测同样的事情?

问题描述 投票:0回答:1

https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset

我使用这个 Kaggle 数据集作为我的糖尿病数据集,并尝试创建一个 LogisticRegression 模型来预测结果。

我创建了以下课程:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import KFold
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

class diabetesLogReg:
    df = pd.read_csv("/Users/aahan_bagga/Desktop/diabetes_data.csv")
    X=df.drop(["Outcome"], axis=1)
    Y=df["Outcome"]
    preg = 0
    glucose = 0
    BP = 0
    skinThickness = 0
    insulin = 0
    bmi = 0
    diabetesPedigreeFunction = 0
    age = 0
    def __init__(self, p, g, BP, ST, I, BMI, DPF, age):
        self.preg = p
        self.glucose = g
        self.BP = BP
        self.skinThickness = ST
        self.insulin = I
        self.bmi = BMI
        self.diabetesPedigreeFunction = DPF
        self.age = age

    def preprocessing(self):
        global Y_train
        global Y_test
        #K-fold cross validation
        kf = KFold(n_splits = 9, shuffle = True, random_state = 19)

        global X_train, X_test, Y_train, Y_test
        for training_index, testing_index in kf.split(self.X):
            X_train, X_test = self.X.iloc[training_index], self.X.iloc[testing_index]
            Y_train, Y_test = self.Y.iloc[training_index], self.Y.iloc[testing_index]


        #Noramlization marginally better than Standardization
        scaler = MinMaxScaler()
        global x_train_s, x_test_s
        x_train_s = scaler.fit_transform(X_train)
        x_test_s = scaler.transform(X_test)

    def train(self):
        global model
        model = LogisticRegression(max_iter = 2000)
        model.fit(x_train_s,Y_train)
        y_pred = model.predict(x_test_s)
        return f"{accuracy_score(Y_test, y_pred) * 100}%"
    
        # TUNE HYPERPARAMETERS HERE
    
    def diabetes_pred(self):
        prob = model.predict_proba([[self.preg, self.glucose, self.BP, self.skinThickness, self.insulin, self.bmi, self.diabetesPedigreeFunction, self.age]])
        print(prob)
        if prob[0,1] > 0.5:
            return "Diabetes"
        else:
            return "No Diabetes"
    
    #def decision_boundary_graph():
        #
    


d = diabetesLogReg(2,126,45,23,340,30,0.12,29)

d.preprocessing()
print(d.train())
print(d.diabetes_pred())

重复输出: 80.0% [[0。 1.]] 糖尿病

它一直在输出其所做的所有预测的“糖尿病”结果。我是机器学习的新手,但我知道我还没有调整我的超参数。这与数据集的长度有关吗,是不是太短了?或者也许与我的 k 折交叉验证有关?

如果有人可以看一下并提供帮助,那就太棒了。

谢谢!

python machine-learning data-science computer-science
1个回答
0
投票

在这种情况下,我认为问题不在于数据集,因为它在 Kaggle 上的可用性评级为 10.0。在你的代码中,唯一让我印象深刻的是你训练模型的 2000 次迭代,这可能有点太多,导致模型过度拟合。

如果您还不想搞乱超参数调整,请尝试自行降低不同间隔的迭代次数,并查看模型行为如何随着不同的

max_iter
值而变化。

如果您想找到最佳迭代次数,请尝试在

GridSearchCV
超参数上使用 sklearn (docs) 中的
max_iter
,以便找到训练模型的最佳迭代次数。

希望这有帮助!

© www.soinside.com 2019 - 2024. All rights reserved.