RidgeClassifier 提供 99% 的准确率

问题描述 投票:0回答:0

所以这是我第一个完全独立的 python 数据科学项目之一,我正在使用我收集的数据集来预测 NBA 比赛的结果(如果主队赢或输)。我的模型在我的训练数据上给出了大约 98% 的准确率,在我的测试数据上给出了高达 99% 的准确率,所以我相信肯定有问题,我只是不知道是什么

from sklearn.model_selection import TimeSeriesSplit
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.linear_model import RidgeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import accuracy_score
time_split = TimeSeriesSplit(n_splits=3)
ridge_model = RidgeClassifier(alpha = 1)
feature_select = SequentialFeatureSelector(ridge_model, n_features_to_select=15,direction='forward', cv=time_split)`

remove = ['MATCHID', 'DATE', 'SEASON', 'HOME', 'AWAY', 'W_HOME']
keep = df.columns[~df.columns.isin(remove)]

scale = MinMaxScaler()
df[keep] = scale.fit_transform(df[keep])
feature_select.fit(df[keep], target)

preds = list(keep[feature_select.get_support()])
preds # 15 features we will use to predict if the home team won or lost

['AWAY_FG3_PCT', 'HOME_FT_PCT', 'AWAY_FT_PCT', 'HOME_OFF_REB', 'AWAY_OFF_REB', 'HOME_DEF_REB', 'HOME_AST', 'AWAY_AST', 'AWAY_STL', 'HOME_TURNOVERS', 'AWAY_TURNOVERS', 'HOME_BLK', 'HOME_PTS', 'AWAY_PTS', 'AWAY_LAST5']

features = df[preds]
target = df['W_HOME']

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, shuffle=False, stratify=None)

model = ridge_model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
train_score

0.9894758146124266

predictions = model.predict(X_test)
predictions = pd.Series(predictions, index = y_test.index)

pred_table = pd.DataFrame(columns = ['PREDICTIONS', 'ACTUAL'])
pred_table['PREDICTIONS'] = predictions
pred_table['ACTUAL'] = y_test
accuracy_score(pred_table['PREDICTIONS'], pred_table['ACTUAL'])

0.9948064211520302

如果需要,我还可以链接其余代码和数据集

到目前为止,我检查了 X_train、X_test、y_train 和 y_test 是否真的不同,我不只是使用训练数据的一个子集来测试,它看起来不像我。

data-science classification data-modeling modeling
© www.soinside.com 2019 - 2024. All rights reserved.