假设我有 dataset 包含一个时间戳(非标准时间戳列,没有日期时间格式)作为单个特征,并且
count
作为标签/目标,以在以下 pandas 数据帧格式中进行预测,如下所示:
X y
Timestamp label
+--------+-----+
|TS_24hrs|count|
+--------+-----+
|0 |157 |
|1 |334 |
|2 |176 |
|3 |86 |
|4 |89 |
... ...
|270 |192 |
|271 |196 |
|270 |251 |
|273 |138 |
+--------+-----+
274 rows × 2 columns
在使用以下策略分割 274 条记录的数据后,我已经在 sklearn
pipeline()
中实现了 RF 回归:
- 将数据拆分为 [training-set + validation-set] Ref. 例如前200条记录[160+40]
- 保持看不见的[测试集]保留以进行最终预测例如最后74条记录(第200行发泄后)
#print(train.shape) #(160, 2)
#print(validation.shape) #(40, 2)
#print(test.shape) #(74, 2)
我尝试了默认管道以及优化管道,通过调整超参数来通过为 RF 管道配备 GridSearchCV() 来获得最佳结果,但是结果并没有改善,如下所示:
from sklearn.metrics import r2_score
print(f"r2 (defaults): {r2_score(test['count'], rf_pipeline2.predict(X_test))}")
print(f"r2 (opt.): {r2_score(test['count'], rf_pipeline2o.predict(X_test))}")
#r2 (defaults): 0.025314471951056405
#r2 (opt.): 0.07593841572721849
重现示例的完整代码:
# Load the time-series data as dataframe
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/content/U2996_24hrs_.csv', sep=",")
# The first 200 records slice for training-set and validation-set
df200 = df[:200]
# The rest records = 74 events (after 200th event) kept as hold-on unseen-set for forecasting
test = df[200:] #test (keep it unseen)
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X = df200[['TS_24hrs']]
y = df200['count']
X_train, X_val, y_train, y_val = train_test_split(X, y , test_size=0.2, shuffle=False, random_state=0) #train + validat
X_test = test['count'].values.reshape(-1,1)
# Train and fit the RF model
from sklearn.ensemble import RandomForestRegressor
#rf_model = RandomForestRegressor(random_state=10).fit(train, train['count']) #X, y
# build an end-to-end pipeline, and supply the data into a regression model and train within pipeline. It avoids leaking the test\val-set into the train-set
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline, make_pipeline
# Pipeline (defaults)
rf_pipeline2 = Pipeline([('scaler', MinMaxScaler()),('RF', RandomForestRegressor(random_state=10))]).fit(X_train,y_train) #Approach 2 train-set excludes label
# Pipeline (optimum)
# Parameters of pipelines can be set using '__' separated parameter names:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits = 5)
param_grid = {
"RF__n_estimators": [10, 50, 100],
"RF__max_depth": [1, 5, 10, 25],
"RF__max_features": [*np.arange(0.1, 1.1, 0.1)],}
rf_pipeline2o = Pipeline([('scaler', MinMaxScaler()),('RF', GridSearchCV(rf_pipeline2,
param_grid=param_grid,
n_jobs=2,
cv=tscv,
refit=True))]).fit(X_train,y_train) #Approach 2 train-set excludes label
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")
#print(rf_pipeline2)
#print(rf_pipeline2o)
# Use the pipeline to predict over the validation-set and test-set
y_predictions_test2 = rf_pipeline2.predict(X_test)
y_predictions_test2o = rf_pipeline2o.predict(X_test)
y_predictions_val2 = rf_pipeline2.predict(X_val)
y_predictions_val2o = rf_pipeline2o.predict(X_val)
# Convert forecast result over the test-set into dataframe for plot issue with ease
df_pre_test_rf2 = pd.DataFrame({'TS_24hrs':test['TS_24hrs'], 'count_forecast_test':y_predictions_test2})
df_pre_test_rf2o = pd.DataFrame({'TS_24hrs':test['TS_24hrs'], 'count_forecast_test':y_predictions_test2o})
# Convert predict result over the validation-set into dataframe for plot issue with ease
df_pre_val_rf2 = pd.DataFrame({'TS_24hrs':X_val['TS_24hrs'], 'count_prediction_val':y_predictions_val2})
df_pre_val_rf2o = pd.DataFrame({'TS_24hrs':X_val['TS_24hrs'], 'count_prediction_val':y_predictions_val2o})
# evaluate performance with MAE
# Evaluate performance by calculate the loss and metric over unseen test-set
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error, explained_variance_score, r2_score
rf_mae_test2 = mean_absolute_error(test['count'], df_pre_test_rf2['count_forecast_test'])
rf_mae_test2o = mean_absolute_error(test['count'], df_pre_test_rf2o['count_forecast_test'])
#visulize forecast or prediction of RF pipleine
import matplotlib.pyplot as plt
fig, ax = plt.subplots( figsize=(10,4))
pd.Series(y_train).plot(label='Training-set', c='b')
pd.Series(y_val).plot(label='Validation-set', linestyle=':', c='b')
test['count'].plot(label='Test-set (unseen)', c='cyan')
#predict plot over validation-set
df_pre_val_rf2['count_prediction_val'].plot(label=f'RF_predict_val (defaults) ', linestyle='--', c='green', marker="+")
df_pre_val_rf2o['count_prediction_val'].plot(label=f'RF_predict_val (opt.) ', linestyle='--', c='purple', marker="+", alpha= 0.4)
#forecast plot over test-set (unseen)
df_pre_test_rf2['count_forecast_test'].plot(label=f'RF_forecast_test (defaults) MAE={rf_mae_test2:.2f}', linestyle='--', c='green', marker="*")
df_pre_test_rf2o['count_forecast_test'].plot(label=f'RF_forecast_test (opt.) MAE={rf_mae_test2o:.2f}', linestyle='--', c='purple', marker="*", alpha= 0.4)
plt.legend()
plt.title('Plot of comparioson results of used implementation approaches trained RF pipeline ')
plt.ylabel('count', fontsize=15)
plt.xlabel('Timestamp [24hrs]', fontsize=15)
plt.show()
我已经实现了不同的方法,但到目前为止我还没有弄清楚如何调试问题。在过去的时间里,我在使用另一个输出恒定预测的回归器时遇到了一些问题,我通过像这样的超参数调整解决了这些问题post。
此外,基于此答案:
“...回归/回归树的随机森林不会对数据点产生预期的预测超出训练数据范围,因为它们无法(很好)推断。”
尽管这可以解释关于测试集上的样本外预测的恒定预测(看不见),但我仍然相信,即使是这种情况,它也应该显示出对验证集的非恒定预测就我而言,虽然它是恒定的。
一些相关帖子: