调试 RandomForestRegressor() 在时间序列数据上产生主要恒定的预测结果

问题描述 投票:0回答:2

假设我有 dataset 包含一个时间戳(非标准时间戳列,没有日期时间格式)作为单个特征,并且

count
作为标签/目标,以在以下 数据帧格式中进行预测,如下所示:

   X        y
Timestamp label
+--------+-----+
|TS_24hrs|count|
+--------+-----+
|0       |157  |
|1       |334  |
|2       |176  |
|3       |86   |
|4       |89   |
 ...      ...
|270     |192  |
|271     |196  |
|270     |251  |
|273     |138  |
+--------+-----+
274 rows × 2 columns

在使用以下策略分割 274 条记录的数据后,我已经在

pipeline()
中实现了 RF 回归:

  • 将数据拆分为 [training-set + validation-set] Ref. 例如前200条记录[160+40]
  • 保持看不见的[测试集]保留以进行最终预测例如最后74条记录(第200行发泄后)
#print(train.shape)          #(160, 2)
#print(validation.shape)     #(40, 2)
#print(test.shape)           #(74, 2)

我尝试了默认管道以及优化管道,通过调整超参数来通过为 RF 管道配备 GridSearchCV() 来获得最佳结果,但是结果并没有改善,如下所示:

from sklearn.metrics import r2_score

print(f"r2 (defaults): {r2_score(test['count'], rf_pipeline2.predict(X_test))}")
print(f"r2 (opt.):     {r2_score(test['count'], rf_pipeline2o.predict(X_test))}")

#r2 (defaults): 0.025314471951056405
#r2 (opt.):     0.07593841572721849

img

重现示例的完整代码:

# Load the time-series data as dataframe
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('/content/U2996_24hrs_.csv', sep=",")

# The first 200 records slice for training-set and validation-set
df200 = df[:200]          

# The rest records = 74 events (after 200th event) kept as hold-on unseen-set for forecasting
test = df[200:]   #test (keep it unseen)

# Split the data into training and testing sets
from sklearn.model_selection import train_test_split

X = df200[['TS_24hrs']]
y = df200['count']
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size=0.2, shuffle=False, random_state=0)  #train + validat
X_test = test['count'].values.reshape(-1,1)


# Train and fit the RF model
from sklearn.ensemble import RandomForestRegressor
#rf_model = RandomForestRegressor(random_state=10).fit(train, train['count']) #X, y

# build an end-to-end pipeline, and supply the data into a regression model and train within pipeline. It avoids leaking the test\val-set into the train-set
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline, make_pipeline

# Pipeline (defaults) 
rf_pipeline2  = Pipeline([('scaler', MinMaxScaler()),('RF', RandomForestRegressor(random_state=10))]).fit(X_train,y_train)   #Approach 2 train-set excludes label

# Pipeline (optimum)
# Parameters of pipelines can be set using '__' separated parameter names:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits = 5)
param_grid = {
    "RF__n_estimators": [10, 50, 100],
    "RF__max_depth":    [1, 5, 10, 25],
    "RF__max_features": [*np.arange(0.1, 1.1, 0.1)],}

rf_pipeline2o = Pipeline([('scaler', MinMaxScaler()),('RF', GridSearchCV(rf_pipeline2,
                                                                         param_grid=param_grid,
                                                                         n_jobs=2,
                                                                         cv=tscv,
                                                                         refit=True))]).fit(X_train,y_train) #Approach 2 train-set excludes label

# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")

#print(rf_pipeline2) 
#print(rf_pipeline2o)

# Use the pipeline to predict over the validation-set and test-set

y_predictions_test2  = rf_pipeline2.predict(X_test)
y_predictions_test2o = rf_pipeline2o.predict(X_test)

y_predictions_val2   = rf_pipeline2.predict(X_val)
y_predictions_val2o  = rf_pipeline2o.predict(X_val)

# Convert forecast result over the test-set into dataframe for plot issue with ease
df_pre_test_rf2  = pd.DataFrame({'TS_24hrs':test['TS_24hrs'],        'count_forecast_test':y_predictions_test2})
df_pre_test_rf2o = pd.DataFrame({'TS_24hrs':test['TS_24hrs'],        'count_forecast_test':y_predictions_test2o})


# Convert predict result over the validation-set into dataframe for plot issue with ease
df_pre_val_rf2  = pd.DataFrame({'TS_24hrs':X_val['TS_24hrs'],        'count_prediction_val':y_predictions_val2})
df_pre_val_rf2o = pd.DataFrame({'TS_24hrs':X_val['TS_24hrs'],        'count_prediction_val':y_predictions_val2o})

# evaluate performance with MAE
# Evaluate performance by calculate the loss and metric over unseen test-set
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, explained_variance_score, r2_score 

rf_mae_test2  = mean_absolute_error(test['count'],                 df_pre_test_rf2['count_forecast_test'])
rf_mae_test2o = mean_absolute_error(test['count'],                 df_pre_test_rf2o['count_forecast_test'])

#visulize forecast or prediction of RF pipleine
import matplotlib.pyplot as plt
fig, ax = plt.subplots( figsize=(10,4))

pd.Series(y_train).plot(label='Training-set', c='b')
pd.Series(y_val).plot(label='Validation-set', linestyle=':', c='b')
test['count'].plot(label='Test-set (unseen)', c='cyan')

#predict plot over validation-set
df_pre_val_rf2['count_prediction_val'].plot(label=f'RF_predict_val (defaults)              ', linestyle='--', c='green',   marker="+")
df_pre_val_rf2o['count_prediction_val'].plot(label=f'RF_predict_val (opt.)       ', linestyle='--', c='purple',    marker="+", alpha= 0.4)

#forecast plot over test-set (unseen)
df_pre_test_rf2['count_forecast_test'].plot(label=f'RF_forecast_test (defaults)     MAE={rf_mae_test2:.2f}', linestyle='--', c='green',   marker="*")
df_pre_test_rf2o['count_forecast_test'].plot(label=f'RF_forecast_test (opt.)           MAE={rf_mae_test2o:.2f}', linestyle='--', c='purple',    marker="*", alpha= 0.4)

plt.legend()
plt.title('Plot of comparioson results of used implementation approaches trained RF pipeline ')
plt.ylabel('count', fontsize=15)
plt.xlabel('Timestamp [24hrs]', fontsize=15)
plt.show()

我已经实现了不同的方法,但到目前为止我还没有弄清楚如何调试问题。在过去的时间里,我在使用另一个输出恒定预测的回归器时遇到了一些问题,我通过像这样的超参数调整解决了这些问题post

此外,基于此答案

“...回归/回归树的随机森林不会对数据点产生预期的预测超出训练数据范围,因为它们无法(很好)推断。”

尽管这可以解释关于测试集上的样本外预测的恒定预测(看不见),但我仍然相信,即使是这种情况,它也应该显示出对验证集的非恒定预测就我而言,虽然它是恒定的。


一些相关帖子:

python machine-learning scikit-learn time-series random-forest
2个回答
0
投票

这一点也不意外。鉴于您唯一的输入是连续数字,测试集上的所有预测将大致对应于训练集的最后一个值,因为它将遵循相同的决策规则。你不能这样进行 TS 预测。要么使用处理 TS 数据的模型(ARIMA、LSTM),要么以模型的输入作为先前 K 个观测值的目标的方式转换数据。


0
投票

您需要使用更合适的模型,例如理解周期性的模型,例如具有正确内核的高斯过程或其他模型。您还可以考虑 作为更适合时间序列数据的库。

© www.soinside.com 2019 - 2024. All rights reserved.