Random Forest-mtry大于自变量总数吗?

问题描述 投票:2回答:1

1)我尝试了回归随机森林来训练带有4个独立变量的185行数据集。2个类别变量分别具有3个级别和13个级别。另外2个变量是数字连续变量。

我尝试用RF进行10倍交叉验证,重复4次。 (我没有缩放因变量,这就是RMSE这么大的原因。)

我想mtry大于4的原因是类别变量总共具有3 + 13 = 16个级别。但是如果是这样,为什么它不包含数字变量number?

185 samples
4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE    
   2    16764183  0.7843863  9267902
   9     9451598  0.8615202  3977457
  16     9639984  0.8586409  3813891

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

[请帮助我理解mtry。

2)而且,每折样本量为168,165,166,....,为什么样本量在变化?

sample sizes: 168, 165, 166, 167, 166, 167

非常感谢。

r random-forest r-caret interpretation
1个回答
0
投票

您是正确的,因为有16个变量可供采样,因此mtry的最大值为16。

插入符号选择的值基于两个参数,在训练中,tuneLength有一个选项,默认为3:

tuneLength = ifelse(trControl$method == "none", 1, 3)

这意味着它将测试三个值。对于randomForest,您具有mtry,默认值为:

getModelInfo("rf")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
                    if(search == "grid") {
                      out <- data.frame(mtry = caret::var_seq(p = ncol(x), 
                                                              classification = is.factor(y), 
                                                              len = len))
                    } else {
                      out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, replace = TRUE)))
                    }
                    out
                  }

由于您有16列,因此变为:

var_seq(16,len=3)
[1]  2  9 16

您可以通过设置来测试所选的mtry:

library(caret)
trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
# we test 2,4,6..16
trg = data.frame(mtry=seq(2,16,by=2))
# some random data for example
df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))

#fit
mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)

Random Forest 

200 samples
  4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE      
   2    1.120216  0.04448700  0.8978851
   4    1.157185  0.04424401  0.9275939
   6    1.172316  0.04902991  0.9371778
   8    1.186861  0.05276752  0.9485516
  10    1.193595  0.05490291  0.9543479
  12    1.200837  0.05608624  0.9574420
  14    1.205663  0.05374614  0.9621094
  16    1.210783  0.05537412  0.9665665

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
© www.soinside.com 2019 - 2024. All rights reserved.