对于表格数据模型中的过度拟合我该怎么办

Question

我建立了一个预测模型，用于根据所提供数据中的某些特征来预测结果。

该模型是一个利用 fastai 的表格学习器。

该数据集包含约 300 条记录，分为训练集、验证集和测试集。

我已经实现了解决过度拟合的技术，例如提前停止和权重衰减，但在对未见过的数据进行评估时，模型仍然似乎过度拟合。

此外，我还尝试调整学习率和批量大小等超参数，但没有改善。我怀疑我的模型架构或预处理管道的某些方面可能会导致该问题，但我不确定从哪里开始调查。

鉴于该项目的敏感性，我无法提供有关数据集或预测任务的具体细节，但我可以分享当前模型的预处理和结构。

这是训练的输出：

纪元	火车损失	有效损失	准确度	时间
0	0.752707	0.579501	0.776119	00:00
1	0.699270	0.833771	0.776119	00:00
2	0.652438	0.598243	0.791045	00:00
3	0.621083	3.889398	0.776119	00:00
4	0.591348	0.632366	0.791045	00:00
5	0.580582	6.670314	0.791045	00:00

自 epoch 2 以来没有任何改善：提前停止

这是预处理的代码（在我构建了我不能透露的功能之后）。

features

列表定义了每个特征，包括有效值范围和权重（

feature

、

range_

和

weight

，如下面的标准化函数中所使用）。

def custom_normalize(df, feature, range_, weight):
    df[feature] = normalize(df[feature], range_)
    df[feature] = df[feature] * weight
    return df

splits = RandomSplitter(valid_pct=0.2)(range_of(df))

procs = [Categorify, FillMissing]

for feature, info in features.items():
    # Determine a range within which to select values when training.
    procs.append(partial(custom_normalize, feature=feature, range_=info['range'], weight=info['weight']))

据我所知，构建模型和训练是相当标准的：

to = TabularPandas(df, procs=procs,
                   cat_names = cat_vars,
                   cont_names = cont_vars,
                   y_names=dep_var,
                   splits=splits)

dls = to.dataloaders(bs=64)

early_stop = EarlyStoppingCallback(monitor='accuracy', min_delta=0.01, patience=3)

learn = tabular_learner(dls, metrics=accuracy, wd=0.1)
learn.lr_find()

# Plot learning rate.
learn.recorder.plot_lr_find()

# Choose a learning rate based on the plot.
lr = learn.recorder.lrs[np.argmin(learn.recorder.losses)]

learn.fit_one_cycle(15, lr, cbs=early_stop)
learn.show_results()

# Only save model if none exists
# TODO wrap save in conditional that prevents saving if a model exists.
if not os.path.exists(model_fname):
    learn.save(model_fname)

Answer 1

对我帮助很大但你可能会忽略的一件事是皮尔逊相关性检查。这将为您提供一份报告（我更喜欢矩阵），您可以在其中了解某些功能之间是否存在相关性。高度相关的特征会极大地影响你的模型并使其容易过度拟合。应删除相关/相关功能。虽然我有很多使用 Python 的经验，但在机器/深度学习方面，我觉得 R 更舒服。我会尝试提供两个版本。

代码是这样的：

在 R 中： #皮尔逊相关系数 # 仅挑选数字数据并创建相关矩阵数字数据 <- data[, sapply(data, is.numeric)] correlation_matrix <- cor(numeric_data, method = "pearson") print(correlation_matrix)

还有Python：

import pandas as pd
import numpy as np

# Assuming 'data' is your DataFrame containing the dataset
# Cherry-pick only numeric data and do a correlation matrix
numeric_data = data.select_dtypes(include=[np.number])  # Selects only numeric columns

# Calculate the Pearson correlation matrix
correlation_matrix = numeric_data.corr(method='pearson')

# Print the correlation matrix
print(correlation_matrix)

另一个陷阱可能是特征标准化。根据您的数据，您可能需要缩放任何数字特征。下面是一个例子。我在这里做最小-最大。缩放类型通常取决于您的数据：

在 R 中：

# Doing Min-Max scaling
scaled_features <- as.data.frame(lapply(numerical_features, rescale))

Python 代码（使用 pandas）：

scaled_features = pd.DataFrame(scaler.fit_transform(numerical_features), columns=numerical_features.columns)

您可以尝试一下并给我反馈吗？总的来说，我相信 ML/DL 中的一切都与特征工程和预处理有关。让数据说话并指导您。

对于表格数据模型中的过度拟合我该怎么办

问题描述投票：0回答：1

1个回答

最新问题

对于表格数据模型中的过度拟合我该怎么办

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1