LightGBM 中的恒定预测值

问题描述 投票:0回答:1

我正在尝试使用 LightGBM 回归来预测变量 (Y)。然而我的预测值都是相同的(即常数)。有人可以帮忙检测问题吗。

data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]

df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])

data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]

df_y = pd.DataFrame(data_y, columns=['Value'])

X_df_earn_ind_fin_train, X_df_earn_ind_fin_test, y_df_earn_ind_fin_train, y_df_earn_ind_fin_test = train_test_split(df_x, df_y, test_size=0.3, random_state=21)

hyper_params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['mape', 'auc'],
    'learning_rate': 0.01,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    'verbose_eval': -1,
    "max_depth": 10,
    "num_leaves": 96,  
    "max_bin": 256,
    "num_iterations": 1000,
    "n_estimators": 250
}

gbm = lgm.LGBMRegressor(**hyper_params)
gbm.fit(X_df_earn_ind_fin_train, y_df_earn_ind_fin_train,
        eval_set=[(X_df_earn_ind_fin_test, y_df_earn_ind_fin_test)],
        eval_metric='mape')

y_pred_df_earn_ind_test = gbm.predict(X_df_earn_ind_fin_test)

但是我的输出只是一个常量值的数组

y_pred_df_earn_ind_test = 
array([1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
       1497.21170863, 1497.21170863])

我该如何纠正这个问题?

python python-3.x machine-learning lightgbm boosting
1个回答
0
投票

简短回答

当训练数据少于200行时,使用以下参数:

  • min_data_in_leaf = 1
  • min_data_in_bin = 1

详情

LightGBM 有一些重要的参数来防止过度拟合,这些参数的默认值假设您至少有几百个样本。

  • min_data_in_leaf
    :必须落入叶节点的最小样本数(默认 = 20)
  • min_data_in_bin
    :当 LightGBM 离散化特征时,组合到一个直方图“bin”中的最小样本数(默认 = 3)

有关更多详细信息,请参阅 “为什么 LightGBM 中的 R2 分数为零?”“为什么这个简单的 LightGBM 分类器表现不佳?”.

对于像示例中的数据集(41 行,3 列)这样的非常小的数据集,这些默认值可能非常有限,导致每棵树仅添加几个分割。

请考虑使用您提供的数据(使用 Python 3.11、

lightgbm==4.3.0
pandas==2.2.1
scikit-learn==1.4.1
)的以下示例。

import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import train_test_split

data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]

df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])

data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]

df_y = pd.DataFrame(data_y, columns=['Value'])

X_train, X_test, y_train, y_test = train_test_split(
    df_x,
    df_y,
    test_size=0.3,
    random_state=21
)

params = {
    "num_iterations": 10,
    "objective": "regression",
    "min_data_in_leaf": 1,
    "min_data_in_bin": 1,
    "verbose": 0,
}

# train
gbm = lgb.LGBMRegressor(**params)
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_metric='mape')

# predict
preds = gbm.predict(X_test)
print(preds)

这会产生一些变化的预测。

[1514.86588126 1557.1389268  1423.54076682 1514.86588126 1488.24836945
 1541.52116271 1555.63537413 1393.69927646 1404.48244093 1465.1569698
 1404.48244093 1404.48244093 1514.86588126 1440.95713788 1535.84165832
 1482.58308126 1471.96999117 1504.50006758]

以及测试集上的以下分数

from sklearn.metrics import mean_absolute_error, r2_score

mean_absolute_error(y_test, preds)
# 45.212

r2_score(y_test, preds)
# 0.47

与原始问题相关的一些其他注释:

  • num_iterations
    n_estimators
    是彼此的别名......它们的意思完全相同。只需使用其中之一即可。 (LightGBM 文档)
  • "auc"
    是一个分类指标...它不适合回归问题(LightGBM文档
  • tas"
    仅适用于 LightGBM CLI。它根本不影响Python 包。省略它。 (LightGBM 文档)
  • 在 LightGBM 的
    scikit-learn
    估计器中,省略
    metric
    中的
    params
    并仅将
    eval_metric
    关键字参数传递给
    .fit()
© www.soinside.com 2019 - 2024. All rights reserved.