https://www.kaggle.com/datasets/akshaydattatraykhare/diabetes-dataset
我使用这个 Kaggle 数据集作为我的糖尿病数据集,并尝试创建一个 LogisticRegression 模型来预测结果。
我创建了以下课程:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import KFold
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
class diabetesLogReg:
df = pd.read_csv("/Users/aahan_bagga/Desktop/diabetes_data.csv")
X=df.drop(["Outcome"], axis=1)
Y=df["Outcome"]
preg = 0
glucose = 0
BP = 0
skinThickness = 0
insulin = 0
bmi = 0
diabetesPedigreeFunction = 0
age = 0
def __init__(self, p, g, BP, ST, I, BMI, DPF, age):
self.preg = p
self.glucose = g
self.BP = BP
self.skinThickness = ST
self.insulin = I
self.bmi = BMI
self.diabetesPedigreeFunction = DPF
self.age = age
def preprocessing(self):
global Y_train
global Y_test
#K-fold cross validation
kf = KFold(n_splits = 9, shuffle = True, random_state = 19)
global X_train, X_test, Y_train, Y_test
for training_index, testing_index in kf.split(self.X):
X_train, X_test = self.X.iloc[training_index], self.X.iloc[testing_index]
Y_train, Y_test = self.Y.iloc[training_index], self.Y.iloc[testing_index]
#Noramlization marginally better than Standardization
scaler = MinMaxScaler()
global x_train_s, x_test_s
x_train_s = scaler.fit_transform(X_train)
x_test_s = scaler.transform(X_test)
def train(self):
global model
model = LogisticRegression(max_iter = 2000)
model.fit(x_train_s,Y_train)
y_pred = model.predict(x_test_s)
return f"{accuracy_score(Y_test, y_pred) * 100}%"
# TUNE HYPERPARAMETERS HERE
def diabetes_pred(self):
prob = model.predict_proba([[self.preg, self.glucose, self.BP, self.skinThickness, self.insulin, self.bmi, self.diabetesPedigreeFunction, self.age]])
print(prob)
if prob[0,1] > 0.5:
return "Diabetes"
else:
return "No Diabetes"
#def decision_boundary_graph():
#
d = diabetesLogReg(2,126,45,23,340,30,0.12,29)
d.preprocessing()
print(d.train())
print(d.diabetes_pred())
重复输出: 80.0% [[0。 1.]] 糖尿病
它一直在输出其所做的所有预测的“糖尿病”结果。我是机器学习的新手,但我知道我还没有调整我的超参数。这与数据集的长度有关吗,是不是太短了?或者也许与我的 k 折交叉验证有关?
如果有人可以看一下并提供帮助,那就太棒了。
谢谢!
在这种情况下,我认为问题不在于数据集,因为它在 Kaggle 上的可用性评级为 10.0。在你的代码中,唯一让我印象深刻的是你训练模型的 2000 次迭代,这可能有点太多,导致模型过度拟合。
如果您还不想搞乱超参数调整,请尝试自行降低不同间隔的迭代次数,并查看模型行为如何随着不同的
max_iter
值而变化。
如果您想找到最佳迭代次数,请尝试在
GridSearchCV
超参数上使用 sklearn (docs) 中的 max_iter
,以便找到训练模型的最佳迭代次数。
希望这有帮助!