LightGBM - 我实际上有多少棵树？

Question

初学者在这里尝试 LGBM。我的代码看起来像这样

clf = lgb.LGBMClassifier(max_depth=3, verbosity=-1, n_estimators=3)
clf.fit(train_data[features], train_data['y'], sample_weight=train_data['weight'])
print (f"I have {clf.n_estimators_} estimators")
fig, ax = plt.subplots(nrows=4, figsize=(50,36), sharex=True)
lgb.plot_tree(clf, tree_index=7, dpi=600, ax=ax[0]) # why does it have 7th tree?
lgb.plot_tree(clf, tree_index=8, dpi=600, ax=ax[1]) # why does it have 8th tree?
#lgb.plot_tree(clf, tree_index=9, dpi=600, ax=ax[2]) # crashes
#lgb.plot_tree(clf, tree_index=10, dpi=600, ax=ax[3]) # crashes

我很惊讶，尽管

n_estimators=3

，我似乎有9棵树？我实际上如何设置树的数量，与此相关的是，

n_estimators

有什么作用？我读过文档，我以为是树的数量，但似乎是别的东西。

另外，我如何解释单独的树及其顺序 0、1、2 等。我了解随机森林，以及每棵树如何同等重要。在 boosting 中，第一棵树最重要，下一棵树的重要性要低得多，下一棵树的重要性要低得多。那么在我的脑海中，当我查看树形图时，我该如何“模拟”LightGBM 推理过程？

Answer 1

我实际上如何设置树的数量，与此相关的是，n_estimators 是做什么的？

传递

n_estimators

或其别名之一（LightGBM 文档）。

LightGBM 的

n_estimators

 界面中的

scikit-learn

（类似

LGBMClassifier

的类）控制增强轮数。

对于多类分类以外的所有任务，LightGBM 将在每个 boosting 轮生成 1 棵树。

对于多类分类，LightGBM 将在每个 boosting 轮中训练 1 棵树每个类。

因此，例如，如果您的目标有 5 个类别，则使用

n_estimators=3

进行训练并且不提前停止将产生 15 棵树。

如何解释单独的树及其顺序 0、1、2 等...如何“模拟”LightGBM 推理过程？

每个连续的

{num_classes}

树分组对应于一轮助推轮。它们按目标类别排序。

给定输入

，LightGBM 预测

属于

类将由以下公式给出：

tree_{i}(X) +
tree_{i+num_classes}(X) +
tree_{i+num_classes*2}(X)
... etc/, etc.

例如，考虑使用 LightGBM 的内置多类目标进行 5 类多类分类和 3 轮提升。

属于第一类的样本

x

的LightGBM得分将是第1、6和11棵树的相应叶子值的总和。

这是一个最小的、可重现的示例，使用

lightgbm==4.3.0

 和 Python 3.11。

import lightgbm as lgb
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs

# generate multiclass dataset with 5 classes
X, y = make_blobs(n_samples=1_000, centers=5, random_state=773)

# fit a small multiclass classification model
clf = lgb.LGBMClassifier(n_estimators=3, num_leaves=4, seed=708)
clf.fit(X, y)

# underlying model has 15 trees
clf.booster_.num_trees()
# 15

# but just 3 iterations (boosting rounds)
clf.n_iter_
# 3

# just plot the trees for the first class
lgb.plot_tree(clf, tree_index=0)
lgb.plot_tree(clf, tree_index=5)
lgb.plot_tree(clf, tree_index=10)
plt.show()

这将生成如下所示的树形图：

尝试为训练数据的第一行生成原始预测。

clf.predict(X, raw_score=True)[0,]
# array([-1.24209749, -1.90204682, -1.9020346 , -1.89711144, -1.23250193])

您可以手动计算该样本在每棵树中属于哪个叶节点并添加这些叶值。该数字应与上面原始分数中的第一项匹配（在本例中为

-1.24209749

）。

如果您有

pandas

 可用，您可能会发现将树结构转储到数据框并在那里使用它会更容易。

model_df = clf.booster_.trees_to_dataframe()

# trees relevant to class 0
relevant_trees = [0, 5, 10]

# figure out which leaf index each sample falls into
leaf_preds = clf.predict(X, pred_leaf=True)[0,]

# subset that to only the trees relevant to class 0
relevant_leaf_ids = [
    f"0-L{leaf_preds[0]}",
    f"5-L{leaf_preds[5]}",
    f"10-L{leaf_preds[10]}"
]

# show the values LightGBM would predict from each tree
model_df[
  model_df["tree_index"].isin(relevant_trees) &
  model_df["node_index"].isin(relevant_leaf_ids)
][["tree_index", "node_index", "value"]]

    tree_index node_index     value
5            0       0-L3 -1.460720
38           5       5-L3  0.119902
73          10      10-L3  0.098720

这 3 个值相加到

-1.242098

，几乎与

clf.predict(X, raw_score=True)

预测的分数相同（只是打印时丢失的数字精度不同）。

LightGBM - 我实际上有多少棵树？

问题描述投票：0回答：1

1个回答

最新问题

LightGBM - 我实际上有多少棵树？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1