如何进行二进制时间序列或序列预测？

Question

我目前面临预测问题。我的问题可能错位了，但我需要不同的视角。

所以，我在 pndas df 中有一个时间序列为：

日期（索引）	目标变量	功能1	功能2	功能m
20230204，	35	xxx	xxx	xxx
20230205，	34.2	xxx	xxx	xxx
20230206，	36.5	xxx	xxx	xxx

在进行预测之前，因为我只需要知道目标变量是否会在某个时间范围内达到某个阈值（我称之为“滚动周期”，因此 df 中前进的记录数），如果发生 targetVar> ，我会创建 newTargetVar滚动期内的阈值。

所以我们有这样的 df :

日期（索引）	新目标变量	功能1	功能2	,特征m
20230204	1	xxx	xxx	xxx
20230205，	1	xxx	xxx	xxx
20230206，	0	xxx	xxx	xxx

现在我们有一个 0 和 1 的序列来进行预测。在 newTargetvar 中 df 的末尾，缺失的 0ad 1 被估算为一些合理的值。这个想法是在最后日期达到目标的可能性，因此做出决定。

随着时间的推移，变量会以不同的方式影响响应，因此模型应该不太重视较旧的观察结果，而应更重视最近的观察结果。

我尝试了滑动窗口方法。取一个特定维度的窗口，直到时间（记录）t，训练分类器，并在时间 t+1 进行预测。最后，经过各种迭代后，我得到了一些指标，例如测试数据上的 auc

我知道这种方法存在前瞻性偏差的问题，但我证实这比预期的要强。此外，我真诚地认为0和1的序列不会随机发生，而是会在某些条件下发生，在时间t和t+1之间非常困难，0和1的实现会发生如此大的变化。因此该方法应考虑这种可能性。但如果我在特征中的时间 t-1 或 t-2 插入 newTargetvar ，偏差将变得最差。

希望一切都清楚。

我的方法的定义有本质上的错误吗？你知道如何获得我想要估计的概率吗？有什么型号、建议吗？你会如何解决这个问题？

Answer 1

尝试使用时间衰减特征工程和 GRU 模型来解决二进制时间序列预测问题。

在这里我尝试创建您可能拥有的数据

import pandas as pd
import numpy as np

# Generate a sample dataframe
np.random.seed(42) # For reproducibility
dates = pd.date_range('20230204', periods=60) # 60 days of data
data = {
    'targetVar': np.random.uniform(30, 40, size=60), # Random values between 30 and 40
    'feature1': np.random.uniform(0, 1, size=60),
    'feature2': np.random.uniform(-1, 1, size=60),
    # Add more features as needed
}
df = pd.DataFrame(data, index=dates)

# Define a threshold and rolling period
threshold = 35
rolling_period = 5

# Generate 'newTargetVar' based on the rolling maximum of 'targetVar' exceeding the threshold
df['newTargetVar'] = (df['targetVar'].rolling(window=rolling_period).max() > threshold).astype(int).shift(-rolling_period + 1).fillna(0)

然后应用时间衰减特征工程

# Applying a simple linear time decay to a rolling window of features
def apply_time_decay(df, decay_window):
    # Assuming df is your DataFrame with features and 'Date' as the index
    decayed_features = pd.DataFrame(index=df.index)
    for column in df.columns:
        if column not in ['newTargetVar', 'targetVar']:  # Exclude target variables
            decayed_column = df[column].rolling(window=decay_window).apply(lambda x: np.dot(x, np.linspace(1, 0.5, decay_window))/decay_window, raw=True)
            decayed_features[column + '_decayed'] = decayed_column
    return decayed_features

decayed_df = apply_time_decay(df[['feature1', 'feature2']], 5)  # Example with a 5-day decay window
df = df.join(decayed_df)

然后为GRU模型准备序列

def create_sequences(df, target_column, n_steps):
    X, y = [], []
    for i in range(len(df)):
        end_ix = i + n_steps
        if end_ix > len(df):
            break
        seq_x, seq_y = df.iloc[i:end_ix, :].drop(columns=[target_column]).values, df.iloc[end_ix-1][target_column]
        X.append(seq_x)
        y.append(seq_y)
    return np.array(X), np.array(y)

n_steps = 10  # Number of time steps in each sequence
features_columns = [col for col in df.columns if 'decayed' in col]  # Use only decayed features for training
X, y = create_sequences(df.join(df['newTargetVar'])[features_columns + ['newTargetVar']], 'newTargetVar', n_steps)

然后构建模型

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import GRU, Dense
from tensorflow.keras.optimizers import Adam

# Building the GRU model
model = Sequential([
    GRU(128, input_shape=(n_steps, len(features_columns))),
    Dense(1, activation='sigmoid')
])

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Split the data into training and testing (simple split for this example, consider time-based splitting)
split_idx = int(len(X) * 0.8)
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

# Train the model
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

如何进行二进制时间序列或序列预测？

问题描述投票：0回答：1

1个回答

最新问题

如何进行二进制时间序列或序列预测？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1