为什么R中不同的随机森林实现会产生不同的结果?

问题描述 投票:1回答:1

我承认,除了编写它们的人之外,问一个人这是一个有点困难的问题,但是我在R中随机森林的三个不同版本中获得了持续不同的结果。

有问题的三种方法是randomForest包,插入符号中的“rf”方法和游侠包。代码包含在下面。

有问题的数据就是一个例子;我在类似数据的其他规范中看到类似的东西。

LHS变量:党派识别(Dem,Rep,Indep。)。右侧预测因素是人口统计数据。为了试图找出一些bizarre results in the randomForest package发生了什么,我尝试在其他两种方法中实现相同的模型。我发现他们不会再现那种特殊的异常现象;这特别奇怪,因为据我所知,插入符号中的rf方法只是间接使用randomForest包。

我在每个实现中运行的三个规范是(1)三个类别分类,(2)删除独立类别,以及(3)与2相同但将单个观察加扰到“独立”以在模型中保留三个类别,应该产生类似的结果2.据我所知,在任何情况下都不应该有任何过度或不足的抽样来解释结果。

我还注意到以下趋势:

  1. randomForest包是唯一一个只有两个类别完全失控的包。
  2. 游侠包一致地识别(正确和错误)更多的观察作为独立的。
  3. 在整体预测准确性方面,游侠包总是略差。
  4. 插入符号包的总体准确性与randomForest(略高)相似,但在更常见的类中始终更好,而在不太常见的类中更差。这很奇怪,因为据我所知,在任何一种情况下我都没有实现任何过采样或欠采样,因为我认为插入符号依赖于randomForest包。

下面我列出了代码和混淆矩阵,显示了有争议的差异。重新运行代码每次在混淆矩阵中产生类似的趋势;这不是“任何个别运行都可能产生奇怪结果”的问题。

有没有人知道为什么这些包会一直产生稍微不同(在randomForest中的链接问题的情况下,非常不同)导致一般,甚至更好,为什么它们会以这种特殊的方式不同?例如,我应该注意这些包装的包装中是否存在某种样品加权/分层?

码:

num_trees=1001
var_split=3

load("three_cat.Rda")
rf_three_cat  <-randomForest(party_id_3_cat~{RHS Vars},
                         data=three_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)

two_cat<-subset(three_cat,party_id_3_cat!="2. Independents")    
two_cat$party_id_3_cat<-droplevels(two_cat$party_id_3_cat)
rf_two_cat    <-randomForest(party_id_3_cat~{RHS Vars},
                         data=two_cat,
                         ntree=num_trees,
                         mtry=var_split,
                         type="classification",
                         importance=TRUE,confusion=TRUE)
scramble_independent<-subset(three_cat,party_id_3_cat!="2. Independents")
scramble_independent[1,19]<-"2. Independents"
scramble_independent<- data.frame(lapply(scramble_independent, as.factor), stringsAsFactors=TRUE)
rf_scramble<-randomForest(party_id_3_cat~{RHS Vars},
                      data=scramble_independent,
                      ntree=num_trees,
                      mtry=var_split,
                      type="classification",
                      importance=TRUE,confusion=TRUE)

ranger_2<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=two_cat,
             num.trees=num_trees,mtry=var_split)
ranger_3<-ranger(formula=party_id_3_cat~{RHS Vars},
             data=three_cat,
             num.trees=num_trees,mtry=var_split)
ranger_scram<-ranger(formula=party_id_3_cat~{RHS Vars},
                 data=scramble_independent,
                 num.trees=num_trees,mtry=var_split)

rfControl <- trainControl(method = "none", number = 1, repeats = 1)
rfGrid <- expand.grid(mtry = c(3))
rf_caret_3        <- train(party_id_3_cat~{RHS Vars},
                      data=three_cat,
                      method="rf", ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)
rf_caret_2        <- train(party_id_3_cat~{RHS Vars},
                data = two_cat,
                method = "rf",ntree=num_trees,
                type="classification",
                importance=TRUE,confusion=TRUE,
                trControl = rfControl, tuneGrid = rfGrid)
rf_caret_scramble <- train(party_id_3_cat~{RHS Vars},
                      data = scramble_independent,
                      method = "rf",ntree=num_trees,
                      type="classification",
                      importance=TRUE,confusion=TRUE,
                      trControl = rfControl, tuneGrid = rfGrid)

rf_three_cat$confusion
ranger_3$confusion.matrix
rf_caret_3$finalModel["confusion"]

rf_two_cat$confusion
ranger_2$confusion.matrix
rf_caret_2$finalModel["confusion"]

rf_scramble$confusion
ranger_scram$confusion.matrix
rf_caret_scramble$finalModel["confusion"]

结果(格式略有修改以进行比较):

> rf_three_cat$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1121               3                                697   0.3844042
2. Independents                                                   263               7                                261   0.9868173
3. Republicans (including leaners)                                509               9                               1096   0.3209418                        

> ranger_3$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1128              46                                647   0.3805601
2. Independents                                                 263              23                                245   0.9566855
3. Republicans (including leaners)                              572              31                               1011   0.3736059

> rf_caret_3$finalModel["confusion"]
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1268               0                                553   0.3036793
2. Independents                                                   304               0                                227   1.0000000
3. Republicans (including leaners)                                606               0                               1008   0.3754647

> rf_two_cat$confusion
                                     1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                                 1775                                 46   0.0252608
3. Republicans (including leaners)                               1581                                 33   0.9795539

> ranger_2$confusion.matrix
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1154                                667   0.3662823
3. Republicans (including leaners)                              590                               1024   0.3655514

> rf_caret_2$finalModel["confusion"]
                                   1. Democrats (including leaners) 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315                                  506   0.2778693
3. Republicans (including leaners)                              666                                  948   0.4126394

> rf_scramble$confusion
                                     1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1104               0                                717   0.3937397
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              501               0                               1112   0.3106014

> ranger_scram$confusion.matrix
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners)
1. Democrats (including leaners)                               1159               0                               662  0.3635365
2. Independents                                                   0               0                                 1  1.0000000
3. Republicans (including leaners)                              577               0                              1036  0.3577185

> rf_caret_scramble$finalModel["confusion"]
                                   1. Democrats (including leaners) 2. Independents 3. Republicans (including leaners) class.error
1. Democrats (including leaners)                               1315               0                                506   0.2778693
2. Independents                                                   0               0                                  1   1.0000000
3. Republicans (including leaners)                              666               0                                947   0.4128952
r machine-learning random-forest r-caret
1个回答
0
投票

首先,随机森林算法是......随机的,因此默认情况下会有一些变化。其次,更重要的是,算法是不同的,即它们使用不同的步骤,这就是为什么你得到不同的结果。

您应该看看它们如何执行拆分(哪个标准:gini,额外等),如果这些是随机的(非常随机化的树),他们如何采样自举样本(有/无替换)和什么比例,mtry或在节点中的每个拆分,最大深度或最大情况下选择了多少变量,依此类推。

© www.soinside.com 2019 - 2024. All rights reserved.