我对
sklearn
DecisionTreeClassifier
有些误解。对于以下代码:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from scipy.stats import entropy
import numpy as np
train_x = np.array([[-0.62144528, 0.37728335],
[-0.46795808, 0.2464509 ],
[-0.31221227, -0.61933418],
[-0.37111143, -0.37863888],
[-0.4473217, -0.15192771],
[-0.55939442, 0.59526016],
[-0.37522823, -0.2779457 ],
[-0.39228952, -0.37050653],
[-0.43533553, -0.02755128],
[-0.45524276, 0.39507087],
[-0.50147608, 0.58797464],
[-0.47677197, -0.64571978],
[-0.41001417, -0.16771494],
[-0.39795968, -0.27224625],
[-0.46032929, 0.24087007],
[-0.50722624, 0.51068014],
[-0.44299732, 0.00477296],
[-0.37282845, -0.68609962],
[-0.40829113, -0.26251665],
[-0.46950366, 0.14817891],
[-0.58785758, 0.25280204],
[-0.45326652, 0.0034019 ],
[-0.41441818, 0.14027937]])
train_y = np.array([[ 7], [ 2], [12], [11], [ 4], [ 7], [10], [11], [ 1], [ 3], [ 7], [10], [ 1], [ 8], [ 3], [ 2], [ 4], [ 5], [ 8], [ 4], [ 7], [ 4], [ 1]])
clf = DecisionTreeClassifier(random_state=0, criterion="entropy", max_depth=1)
clf = clf.fit(train_x, train_y)
values_root, counts_root = np.unique(train_y, return_counts=True)
counts_root = counts_root / len(train_x)
entropy_root = entropy(counts_root, base=2)
train_left = train_y[train_x[:, 1] <= -0.65]
train_right = train_y[train_x[:, 1] > -0.65]
values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)
values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)
entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)
print('Information gain manual split: ', entropy_root - (entropy_1 + entropy_2) / 2)
train_left = train_y[train_x[:, 1] <= clf.tree_.threshold[0]]
train_right = train_y[train_x[:, 1] > clf.tree_.threshold[0]]
values1, counts1 = np.unique(train_left, return_counts=True)
counts1 = counts1 / len(train_left)
values2, counts2 = np.unique(train_right, return_counts=True)
counts2 = counts2 / len(train_right)
entropy_1 = entropy(counts1, base=2)
entropy_2 = entropy(counts2, base=2)
print('Information gain sklearn split: ', entropy_root - (entropy_1 + entropy_2) / 2)
plot_tree(clf)
DecisionTreeClassifier 找到的最优分割是 y <= -0.215:
这给出了 3.186 - (2.25 + 2.257) / 2 = 0.933 的信息增益。 但是,如代码所示,将 y 拆分为 -0.65 会导致 IG 为 3.186 - (0 + 3.061) / 2 = 1.656。为什么它找不到像这样具有更高 IG 的拆分? min_samples_leaf 设置为其默认值 1,所以这不是原因。我犯了什么错误?
改进是使用平均子杂质由每个节点中的样本数加权来衡量的。参见例如$G(Q_m, heta)$ 的等式用户指南的这一部分.
所以分裂后决策树的杂质
2.25*8/23 + 2.257*15/23 = 2.2545
和你的候选人有0*1/23 + 3.061*22/23 = 2.9279