我正在做一个自定义项目,我试图预测从 1970 年到 2022 年我的数据集中所有球员的棒球击球和投球统计数据。为简单起见并减少潜在的混乱,我将只参考我的击球数据集。清理我的数据集后,它是 26768 行 × 33 列。
我想强迫自己学习新东西,所以我决定使用 RNN 模型。
项目目标: 从联盟的第 3 个赛季开始到最后一个赛季预测每位球员的 5 个统计数据。
先睹为快:
ValueError: cannot reshape array of size 36630 into shape (1,33,20)
首先,我将提供一些背景知识,以帮助审查我的问题
我在岭回归中使用顺序特征选择来获得我对每个统计数据的预测:
rr = Ridge(alpha=1)
split = TimeSeriesSplit(n_splits=3)
bat_sfs_ba = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_rbi = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_hr = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_bb = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
bat_sfs_so = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1)
缩放数据:
scaler = MinMaxScaler()
batting.loc[:, bat_cols] = scaler.fit_transform(batting[bat_cols])
pitching.loc[:, pitch_cols] = scaler.fit_transform(pitching[pitch_cols])
合身数据:
bat_sfs_ba.fit(batting[bat_cols], batting['Nxt_BA'])
bat_sfs_rbi.fit(batting[bat_cols], batting['Nxt_RBI'])
bat_sfs_hr.fit(batting[bat_cols], batting['Nxt_HR'])
bat_sfs_bb.fit(batting[bat_cols], batting['Nxt_BB'])
bat_sfs_so.fit(batting[bat_cols], batting['Nxt_SO'])
获得的预测变量列表:
bat_ba_preds = list(bat_cols[bat_sfs_ba.get_support()])
bat_rbi_preds = list(bat_cols[bat_sfs_rbi.get_support()])
bat_hr_preds = list(bat_cols[bat_sfs_hr.get_support()])
bat_bb_preds = list(bat_cols[bat_sfs_bb.get_support()])
bat_so_preds = list(bat_cols[bat_sfs_so.get_support()])
我的平均击球率预测器的例子:
['年龄','G','PA','AB','R','H','2B','3B','HR','RBI','CS','BB', 'SO','OBP','OPS','TB','GDP','SH','SF','IBB']
进口型号:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout
在函数内构建多元时间序列 LSTM 模型:
def bat_ba_mrnn (data, model, predictors, start=2, step=1):
bat_preds = []
seasons = sorted(data["Year"].unique())
for i in range(start, len(seasons), step):
current_season = seasons[i]
train = data[data['Year'] < current_season]
test = data[data['Year'] == current_season]
model = Sequential()
train = train.values.reshape(1, 33, 20)
model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
model.add(Dropout(0.25))
model.add(LSTM(units = 142, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 125, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 100, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
model.add(LSTM(units = 50, return_sequences= False))
model.add(Dense(units = 1))
model.compile(optimizer = 'adam', loss = 'mean_squared_error')
model.fit(train[predictors], train['Nxt_BA'])
preds = model.predict(test[predictors])
preds = pd.Series(preds, index=test.index)
together = pd.concat([test['Nxt_BA'], preds], axis=1)
together.columns = ['actual', 'prediction']
bat_preds.append(together)
return pd.concat(bat_preds)
我最初在期望 3dim 时收到形状为 2dim 的错误,所以我将其重塑为上面显示的形状,现在当我运行它时:
bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)
它给我这个错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [160], in <cell line: 1>()
----> 1 bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)
Input In [159], in bat_ba_mrnn(data, model, predictors, start, step)
13 test = data[data['Year'] == current_season]
15 model = Sequential()
---> 17 train = train.values.reshape(1, 33, 20)
19 model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
20 model.add(Dropout(0.25))
ValueError: cannot reshape array of size 36630 into shape (1,33,20)
我尝试了很多不同的选择,但一直无法弄清楚如何正确地改造它,这样它才能起作用。
=====
更新:
环顾四周后,我相信用零填充数组可能会解决我的重塑问题,所以经过一些研究后我补充说:
zeros = np.zeros((2,20))
zeros[:train.shape[0],:train.shape[1]] = train
我还将 LSTM 的第一层调整到下面并删除了重塑线,因为经过额外的研究我发现我不必将它转换为 3dim 并且可以保持为 2dim ...如果我理解正确的话:
model.add(LSTM(units = 175, return_sequences = True, input_shape = (33,20)))
虽然看起来可能有,因为我不再收到重塑错误,但我现在收到以下错误:
ValueError: could not convert string to float: 'Alan\xa0Foster'
现在似乎将字符串列包含在此计算中。我尝试删除所有字符串列,但随后出现错误:
IndexError: tuple index out of range
我不确定如何克服所有这些错误