球员统计预测的多变量时间序列 RNN (LSTM) 问题

问题描述 投票:0回答:0

我正在做一个自定义项目,我试图预测从 1970 年到 2022 年我的数据集中所有球员的棒球击球和投球统计数据。为简单起见并减少潜在的混乱,我将只参考我的击球数据集。清理我的数据集后,它是 26768 行 × 33 列。

我想强迫自己学习新东西,所以我决定使用 RNN 模型。

项目目标: 从联盟的第 3 个赛季开始到最后一个赛季预测每位球员的 5 个统计数据。

先睹为快:

ValueError: cannot reshape array of size 36630 into shape (1,33,20)

首先,我将提供一些背景知识,以帮助审查我的问题

我在岭回归中使用顺序特征选择来获得我对每个统计数据的预测:

rr = Ridge(alpha=1)

split = TimeSeriesSplit(n_splits=3)

bat_sfs_ba = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_rbi = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_hr = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_bb = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 
bat_sfs_so = SequentialFeatureSelector(rr, n_features_to_select=20, cv=split, n_jobs= -1) 

缩放数据:

scaler = MinMaxScaler()
batting.loc[:, bat_cols] = scaler.fit_transform(batting[bat_cols])
pitching.loc[:, pitch_cols] = scaler.fit_transform(pitching[pitch_cols])

合身数据:

bat_sfs_ba.fit(batting[bat_cols], batting['Nxt_BA'])
bat_sfs_rbi.fit(batting[bat_cols], batting['Nxt_RBI'])
bat_sfs_hr.fit(batting[bat_cols], batting['Nxt_HR'])
bat_sfs_bb.fit(batting[bat_cols], batting['Nxt_BB'])
bat_sfs_so.fit(batting[bat_cols], batting['Nxt_SO'])

获得的预测变量列表:

bat_ba_preds = list(bat_cols[bat_sfs_ba.get_support()])
bat_rbi_preds = list(bat_cols[bat_sfs_rbi.get_support()])
bat_hr_preds = list(bat_cols[bat_sfs_hr.get_support()])
bat_bb_preds = list(bat_cols[bat_sfs_bb.get_support()])
bat_so_preds = list(bat_cols[bat_sfs_so.get_support()])

我的平均击球率预测器的例子:

['年龄','G','PA','AB','R','H','2B','3B','HR','RBI','CS','BB', 'SO','OBP','OPS','TB','GDP','SH','SF','IBB']

进口型号:

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

在函数内构建多元时间序列 LSTM 模型:

def bat_ba_mrnn (data, model, predictors, start=2, step=1):
    bat_preds = []
    
    seasons = sorted(data["Year"].unique())
    
    for i in range(start, len(seasons), step):
        current_season = seasons[i]
        train = data[data['Year'] < current_season]
        test = data[data['Year'] == current_season]
        
        model = Sequential()
        
        train = train.values.reshape(1, 33, 20)
        
        model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
        model.add(Dropout(0.25))
        model.add(LSTM(units = 142, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 125, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 100, return_sequences = True, dropout=0.2,recurrent_dropout=0.15))
        model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 75, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= True, dropout=0.2, recurrent_dropout=0.15))
        model.add(LSTM(units = 50, return_sequences= False))
        model.add(Dense(units = 1))             
        
        
        model.compile(optimizer = 'adam', loss = 'mean_squared_error')
        model.fit(train[predictors], train['Nxt_BA'])
        
        preds = model.predict(test[predictors]) 
        preds = pd.Series(preds, index=test.index)
        together = pd.concat([test['Nxt_BA'], preds], axis=1)
        together.columns = ['actual', 'prediction']
        
        bat_preds.append(together)
    return pd.concat(bat_preds)

我最初在期望 3dim 时收到形状为 2dim 的错误,所以我将其重塑为上面显示的形状,现在当我运行它时:

bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)

它给我这个错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [160], in <cell line: 1>()
----> 1 bat_ba_predictions = bat_ba_mrnn(batting, LSTM, bat_ba_preds)

Input In [159], in bat_ba_mrnn(data, model, predictors, start, step)
     13 test = data[data['Year'] == current_season]
     15 model = Sequential()
---> 17 train = train.values.reshape(1, 33, 20)
     19 model.add(LSTM(units = 175, return_sequences = True, input_shape = (train)))
     20 model.add(Dropout(0.25))

ValueError: cannot reshape array of size 36630 into shape (1,33,20)

我尝试了很多不同的选择,但一直无法弄清楚如何正确地改造它,这样它才能起作用。

=====

更新:

环顾四周后,我相信用零填充数组可能会解决我的重塑问题,所以经过一些研究后我补充说:

    zeros = np.zeros((2,20))
    zeros[:train.shape[0],:train.shape[1]] = train

我还将 LSTM 的第一层调整到下面并删除了重塑线,因为经过额外的研究我发现我不必将它转换为 3dim 并且可以保持为 2dim ...如果我理解正确的话:

    model.add(LSTM(units = 175, return_sequences = True, input_shape = (33,20)))

虽然看起来可能有,因为我不再收到重塑错误,但我现在收到以下错误:

ValueError: could not convert string to float: 'Alan\xa0Foster'

现在似乎将字符串列包含在此计算中。我尝试删除所有字符串列,但随后出现错误:

IndexError: tuple index out of range

我不确定如何克服所有这些错误

python lstm reshape recurrent-neural-network multivariate-time-series
© www.soinside.com 2019 - 2024. All rights reserved.