DecisionTreeClassifier 不选择最优分割

问题描述 投票:0回答:1

我对

sklearn
DecisionTreeClassifier
有些误解。对于以下代码:

from sklearn.tree import DecisionTreeClassifier, plot_tree
from scipy.stats import entropy
import numpy as np

train_x = np.array([[-0.62144528,  0.37728335],
 [-0.46795808,  0.2464509 ],
 [-0.31221227, -0.61933418],
 [-0.37111143, -0.37863888],
 [-0.4473217,  -0.15192771],
 [-0.55939442,  0.59526016],
 [-0.37522823, -0.2779457 ],
 [-0.39228952, -0.37050653],
 [-0.43533553, -0.02755128],
 [-0.45524276,  0.39507087],
 [-0.50147608,  0.58797464],
 [-0.47677197, -0.64571978],
 [-0.41001417, -0.16771494],
 [-0.39795968, -0.27224625],
 [-0.46032929,  0.24087007],
 [-0.50722624,  0.51068014],
 [-0.44299732,  0.00477296],
 [-0.37282845, -0.68609962],
 [-0.40829113, -0.26251665],
 [-0.46950366,  0.14817891],
 [-0.58785758,  0.25280204],
 [-0.45326652,  0.0034019 ],
 [-0.41441818,  0.14027937]])

train_y = np.array([[ 7], [ 2], [12], [11], [ 4], [ 7], [10], [11], [ 1], [ 3], [ 7], [10], [ 1], [ 8], [ 3], [ 2], [ 4], [ 5], [ 8], [ 4], [ 7], [ 4], [ 1]])

clf = DecisionTreeClassifier(random_state=0, criterion="entropy", max_depth=1)
clf = clf.fit(train_x, train_y)

values_root, counts_root = np.unique(train_y, return_counts=True)
counts_root = counts_root / len(train_x)
entropy_root = entropy(counts_root, base=2)

train_left = train_y[train_x[:, 1] <= -0.65]
train_right = train_y[train_x[:, 1] > -0.65]

values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)

values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)

entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)

print('Information gain manual split: ', entropy_root - (entropy_1 + entropy_2) / 2)

train_left = train_y[train_x[:, 1] <= clf.tree_.threshold[0]]
train_right = train_y[train_x[:, 1] > clf.tree_.threshold[0]]

values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)

values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)

entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)

print('Information gain sklearn split: ', entropy_root - (entropy_1 + entropy_2) / 2)

plot_tree(clf)

DecisionTreeClassifier 找到的最优分割是 y <= -0.215:

这给出了 3.186 - (2.25 + 2.257) / 2 = 0.933 的信息增益。 但是,如代码所示,将 y 拆分为 -0.65 会导致 IG 为 3.186 - (0 + 3.061) / 2 = 1.656。为什么它找不到像这样具有更高 IG 的拆分? min_samples_leaf 设置为其默认值 1,所以这不是原因。我犯了什么错误?

python scikit-learn decision-tree
1个回答
0
投票

改进是使用平均子杂质由每个节点中的样本数加权来衡量的。参见例如$G(Q_m, heta)$ 的等式用户指南的这一部分.

所以分裂后决策树的杂质

2.25*8/23 + 2.257*15/23 = 2.2545
和你的候选人有
0*1/23 + 3.061*22/23 = 2.9279

© www.soinside.com 2019 - 2024. All rights reserved.