我正在尝试使用 LightGBM 回归来预测变量 (Y)。然而我的预测值都是相同的(即常数)。有人可以帮忙检测问题吗。
data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]
df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])
data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]
df_y = pd.DataFrame(data_y, columns=['Value'])
X_df_earn_ind_fin_train, X_df_earn_ind_fin_test, y_df_earn_ind_fin_train, y_df_earn_ind_fin_test = train_test_split(df_x, df_y, test_size=0.3, random_state=21)
hyper_params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': ['mape', 'auc'],
'learning_rate': 0.01,
'feature_fraction': 0.9,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'verbose': 0,
'verbose_eval': -1,
"max_depth": 10,
"num_leaves": 96,
"max_bin": 256,
"num_iterations": 1000,
"n_estimators": 250
}
gbm = lgm.LGBMRegressor(**hyper_params)
gbm.fit(X_df_earn_ind_fin_train, y_df_earn_ind_fin_train,
eval_set=[(X_df_earn_ind_fin_test, y_df_earn_ind_fin_test)],
eval_metric='mape')
y_pred_df_earn_ind_test = gbm.predict(X_df_earn_ind_fin_test)
但是我的输出只是一个常量值的数组
y_pred_df_earn_ind_test =
array([1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863, 1497.21170863, 1497.21170863,
1497.21170863, 1497.21170863])
我该如何纠正这个问题?
当训练数据少于200行时,使用以下参数:
min_data_in_leaf = 1
min_data_in_bin = 1
LightGBM 有一些重要的参数来防止过度拟合,这些参数的默认值假设您至少有几百个样本。
min_data_in_leaf
:必须落入叶节点的最小样本数(默认 = 20)min_data_in_bin
:当 LightGBM 离散化特征时,组合到一个直方图“bin”中的最小样本数(默认 = 3)有关更多详细信息,请参阅 “为什么 LightGBM 中的 R2 分数为零?” 和 “为什么这个简单的 LightGBM 分类器表现不佳?”.
对于像示例中的数据集(41 行,3 列)这样的非常小的数据集,这些默认值可能非常有限,导致每棵树仅添加几个分割。
请考虑使用您提供的数据(使用 Python 3.11、
lightgbm==4.3.0
、pandas==2.2.1
和 scikit-learn==1.4.1
)的以下示例。
import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import train_test_split
data_x = [[2021,5,368.92],[2023,11,356.82],[2022,10,352.49],[2023,5,343.63],[2023,10,324.91],[2022,12,352.02],[2021,6,370.79],[2022,5,386.59],[2019,2,301.56],[2021,4,353.7],[2021,1,303.93],[2021,9,371.94],[2019,4,310.77],[2021,3,345.3],[2020,5,249.63],[2022,4,381.16],[2023,4,363.14],[2019,7,304.19],[2020,7,258.43],[2022,2,412.47],[2022,8,353.43],[2019,6,302.34],[2020,1,319.88],[2022,7,361.66],[2020,9,265.39],[2022,3,408.72],[2022,1,417.47],[2022,6,351.92],[2022,9,344.06],[2022,11,373.75],[2019,9,314.97],[2019,11,324.14],[2023,2,377.23],[2021,11,380.83],[2021,12,403.12],[2023,7,368.73],[2023,1,379.76],[2019,5,295.02],[2023,9,343.78],[2020,4,248.54],[2019,10,314.79],[2019,8,295.92],[2023,3,354.09],[2023,6,357.35],[2021,2,324.31],[2020,3,246.26],[2019,3,295.36],[2020,12,306.27],[2021,8,376.54],[2020,6,258.21],[2023,8,352.35],[2021,7,370.21],[2020,10,259.13],[2020,8,275.66],[2019,12,315.47],[2020,11,301.27],[2021,10,389.23],[2019,1,291.94],[2020,2,302.38]]
df_x = pd.DataFrame(data_x, columns=['Year', 'Month', 'Close'])
data_y = [[1479.42],[1654.53],[1537.76],[1621.22],[1567.62],[1528.39],[1444.63],[1562.17],[1356.81],[1463.48],[1558.9],[1463.96],[1362.03],[1432.7],[1502.46],[1524.71],[1592.68],[1342.74],[1467.48],[1553.66],[1609.19],[1349.1],[1379.39],[1496.12],[1448.08],[1562.96],[1525.25],[1575.06],[1591.15],[1544.66],[1319.9],[1366.73],[1482.72],[1520.73],[1557.03],[1577.37],[1624.74],[1402.05],[1614.94],[1482.28],[1338.88],[1354.6],[1553.65],[1606.36],[1510.78],[1348.05],[1323.39],[1542.95],[1411.64],[1493.44],[1563.53],[1414.8],[1452.67],[1491.7],[1451.43],[1467.23],[1477.13],[1360.29],[1386.48]]
df_y = pd.DataFrame(data_y, columns=['Value'])
X_train, X_test, y_train, y_test = train_test_split(
df_x,
df_y,
test_size=0.3,
random_state=21
)
params = {
"num_iterations": 10,
"objective": "regression",
"min_data_in_leaf": 1,
"min_data_in_bin": 1,
"verbose": 0,
}
# train
gbm = lgb.LGBMRegressor(**params)
gbm.fit(X_train, y_train,
eval_set=[(X_test, y_test)],
eval_metric='mape')
# predict
preds = gbm.predict(X_test)
print(preds)
这会产生一些变化的预测。
[1514.86588126 1557.1389268 1423.54076682 1514.86588126 1488.24836945
1541.52116271 1555.63537413 1393.69927646 1404.48244093 1465.1569698
1404.48244093 1404.48244093 1514.86588126 1440.95713788 1535.84165832
1482.58308126 1471.96999117 1504.50006758]
以及测试集上的以下分数
from sklearn.metrics import mean_absolute_error, r2_score
mean_absolute_error(y_test, preds)
# 45.212
r2_score(y_test, preds)
# 0.47
与原始问题相关的一些其他注释:
num_iterations
和 n_estimators
是彼此的别名......它们的意思完全相同。只需使用其中之一即可。 (LightGBM 文档)"auc"
是一个分类指标...它不适合回归问题(LightGBM文档)tas"
仅适用于 LightGBM CLI。它根本不影响Python 包。省略它。 (LightGBM 文档)scikit-learn
估计器中,省略 metric
中的 params
并仅将 eval_metric
关键字参数传递给 .fit()