在分层时间序列数据集上训练模型的正确方法是什么？多步预测策略？

Question

想象一下以下格式的时间序列数据集。

import pandas as pd
import numpy as np

# Set the date range
date_range = pd.date_range('2022-01-01', periods=7, freq='D')

# Define the categories
categories = ['A', 'B']

# Create the hierarchical index
index = pd.MultiIndex.from_product([date_range, categories], names=['Date', 'Category'])

# Generate random numbers for each row
data = np.random.rand(len(index))

# Create the dataframe
df = pd.DataFrame(data, index=index, columns=['Value'])

# Add a lag 1 feature
df['Lag1'] = df.groupby('Category')['Value'].shift()

# Split the data into train and test sets
train = df.loc[:'2022-01-04']
test = df.loc['2022-01-05':]

数据集输出如下：

Train Set:
                        Value      Lag1
Date       Category                    
2022-01-01 A         0.016480       NaN
           B         0.186811       NaN
2022-01-02 A         0.668557  0.016480
           B         0.256664  0.186811
2022-01-03 A         0.552484  0.668557
           B         0.607732  0.256664
2022-01-04 A         0.869755  0.552484
           B         0.051533  0.607732

Test Set:
                        Value      Lag1
Date       Category                    
2022-01-05 A         0.036175  0.869755
           B         0.063466  0.051533
2022-01-06 A         0.078312  0.036175
           B         0.129991  0.063466
2022-01-07 A         0.280402  0.078312
           B         0.899824  0.129991

我最近学习了多步预测。我有一些问题。例如，我想使用随机森林模型：

在训练集上训练模型（使用TimeSeriesSplit交叉验证）后，我是否直接使用模型在测试集上进行预测？或者我是否必须一次预测一个步骤——比如首先预测测试集上“2022-01-05”的值，然后是下一个，等等？
我使用滞后值和其他特征。在训练集上训练模型，然后直接在整个测试集上进行预测后，模型通常给予滞后特征最高的重要性，与所有其他特征相比有很大的差距。这会被认为是数据泄露的迹象吗？
多步预测是否仅用于实际预测超出测试集的未来值（如本例，“2022-01-07”之后的日期）？或者这是预测测试数据的方式？

感谢您的帮助和见解。

在分层时间序列数据集上训练模型的正确方法是什么？多步预测策略？

问题描述投票：0回答：0

最新问题

在分层时间序列数据集上训练模型的正确方法是什么？多步预测策略？

问题描述 投票：0回答：0

最新问题

问题描述投票：0回答：0