我正在寻找一种优雅的解决方案,使用在一个数据集中创建的决策规则(例如您的训练集)来根据这些规则分割另一个数据集(例如测试数据)的数据。
看看这个例子:
# Load PimaIndiansDiabetes dataset from mlbench package
library("mlbench")
data("PimaIndiansDiabetes")
## Split in training and test (2/3 - 1/3)
idtrain <- c(sample(1:768,512))
PimaTrain <-PimaIndiansDiabetes[idtrain,]
Pimatest <-PimaIndiansDiabetes[-idtrain,]
m1 <- RWeka::J48(as.factor(as.character(PimaTrain$diabetes)) ~ .,
data = PimaTrain[,-c(9)],
control = RWeka::Weka_control(M = 10, C= 0.25))
这给出了以下输出:
> m1
J48 pruned tree
------------------
glucose <= 154
| age <= 28
| | glucose <= 118: neg (157.0/11.0)
| | glucose > 118
| | | pressure <= 52: pos (10.0/3.0)
| | | pressure > 52: neg (54.0/12.0)
| age > 28
| | glucose <= 103: neg (54.0/10.0)
| | glucose > 103
| | | mass <= 41.3: neg (129.0/55.0)
| | | mass > 41.3: pos (12.0/1.0)
glucose > 154: pos (96.0/19.0)
Number of Leaves : 7
Size of the tree : 13
根据这些规则,您将有7个组(或叶子)。我正在寻找的是在测试数据Pimatest上应用这些规则(因此不重新训练决策树),以便实际上每个数据点可以被指定给用新变量组指示的7个组中的一个。
输出看起来像这样:
head(Pimatest)
pregnant glucose pressure triceps insulin mass pedigree age diabetes group
3 8 183 64 0 0 23.3 0.672 32 pos 7
4 1 89 66 23 94 28.1 0.167 21 neg 1
6 5 116 74 0 0 25.6 0.201 30 neg 5
7 3 78 50 32 88 31.0 0.248 26 pos 1
8 10 115 0 0 0 35.3 0.134 29 neg 5
11 4 110 92 0 0 37.6 0.191 30 neg 5
我目前有一个工作解决方案编码非常糟糕,所以这就是为什么我正在为这个问题寻找一个优雅的解决方案。
据我了解,您希望能够将每个点与对该点进行分类的规则集相关联。您可以通过将J48
树转换为party
树并使用partykit
包中的工具来实现目标。
因为您没有为随机数生成器设置种子,所以我们无法获得与您获得的完全相同的测试/训练分割。我将设置种子以使我的示例可重现,但即使我使用您的代码,我的树将与您的稍有不同。
可重复的示例(主要是您的代码)
library(RWeka)
library("mlbench")
data("PimaIndiansDiabetes")
## Split in training and test (2/3 - 1/3)
set.seed(1234)
idtrain <- c(sample(1:768,512))
PimaTrain <-PimaIndiansDiabetes[idtrain,]
Pimatest <-PimaIndiansDiabetes[-idtrain,]
m1 <- RWeka::J48(as.factor(as.character(PimaTrain$diabetes)) ~ .,
data = PimaTrain[,-c(9)],
control = RWeka::Weka_control(M = 10, C= 0.25))
m1
J48 pruned tree
------------------
glucose <= 122
| mass <= 26.8: neg (85.0/1.0)
| mass > 26.8
| | pregnant <= 4: neg (137.0/19.0)
| | pregnant > 4
| | | glucose <= 106: neg (44.0/10.0)
| | | glucose > 106: pos (24.0/6.0)
glucose > 122
| glucose <= 157
| | age <= 31
| | | age <= 24: neg (30.0/5.0)
| | | age > 24
| | | | pressure <= 72: pos (16.0/5.0)
| | | | pressure > 72: neg (22.0/5.0)
| | age > 31: pos (78.0/27.0)
| glucose > 157: pos (76.0/13.0)
Number of Leaves : 9
Size of the tree : 17
我的树有9片而不是你的7片。这是由于为训练集选择了不同的实例。现在我们准备好了解规则。
library(partykit)
Pm1 = as.party(m1)
Pm1
Fitted party:
[1] root
| [2] glucose <= 122
| | [3] mass <= 26.8: neg (n = 85, err = 1.2%)
| | [4] mass > 26.8
| | | [5] pregnant <= 4: neg (n = 137, err = 13.9%)
| | | [6] pregnant > 4
| | | | [7] glucose <= 106: neg (n = 44, err = 22.7%)
| | | | [8] glucose > 106: pos (n = 24, err = 25.0%)
| [9] glucose > 122
| | [10] glucose <= 157
| | | [11] age <= 31
| | | | [12] age <= 24: neg (n = 30, err = 16.7%)
| | | | [13] age > 24
| | | | | [14] pressure <= 72: pos (n = 16, err = 31.2%)
| | | | | [15] pressure > 72: neg (n = 22, err = 22.7%)
| | | [16] age > 31: pos (n = 78, err = 34.6%)
| | [17] glucose > 157: pos (n = 76, err = 17.1%)
Number of inner nodes: 8
Number of terminal nodes: 9
这与以前的树相同,但具有标记节点的优点。我们还可以为每个叶子写出规则。
Pm1_rules = partykit:::.list.rules.party(Pm1)
Pm1_rules
3
"glucose <= 122 & mass <= 26.8"
5
"glucose <= 122 & mass > 26.8 & pregnant <= 4"
7
"glucose <= 122 & mass > 26.8 & pregnant > 4 & glucose <= 106"
8
"glucose <= 122 & mass > 26.8 & pregnant > 4 & glucose > 106"
12
"glucose > 122 & glucose <= 157 & age <= 31 & age <= 24"
14
"glucose > 122 & glucose <= 157 & age <= 31 & age > 24 & pressure <= 72"
15
"glucose > 122 & glucose <= 157 & age <= 31 & age > 24 & pressure > 72"
16
"glucose > 122 & glucose <= 157 & age > 31"
17
"glucose > 122 & glucose > 157"
决定是作为规则写出来的。规则集的名称是叶节点的编号。要获得用于测试点的规则,您只需要知道它最终到达哪个叶节点。但是派对对象的predict
方法会给你这个。
TestPred = predict(Pm1, newdata=Pimatest, type="node")
TestPred
3 4 5 6 9 12 17 20 22 27 28 29 31 32 33 35 36 38 41 43
17 5 16 3 17 17 5 5 7 16 3 16 8 17 3 8 3 7 17 3
46 48 50 56 57 60 62 64 65 66 68 70 72 75 76 79 84 95 96 97
17 5 3 3 17 5 16 12 8 7 5 15 14 5 3 14 3 12 16 5
...
我截断了输出,因为它太长了。现在,例如,
我们看到第一个测试点转到了节点17.我们只需要使用它来索引规则集。但需要一点点关心。 predict
返回的17是一个数字。规则集的名称是一个字符串,因此我们需要使用as.character
进行转换。
Pm1_rules[as.character(TestPred[1])]
17
"glucose > 122 & glucose > 157"
我们确认:
Pimatest[1,]
pregnant glucose pressure triceps insulin mass pedigree age diabetes
3 8 183 64 0 0 23.3 0.672 32 pos
所以,是的,glucose > 122
和glucose > 157
您可以以相同的方式获取其他测试点的规则。