从随机森林模型(分类变量)中提取变量重要性值的问题

问题描述 投票:0回答:0

我正在重新调整同事编写的随机森林脚本的用途,以使用空间变量和 caret 包运行模型的 100 次迭代,但脚本最初编写时并未考虑分类数据。

这是我的数据的样子

head(sites)
           ID pres         X        Y     depot drainage   slope    elev      dtw
326670 326691    N -69.49958 47.19158 organique        2 2.53601 531.158  964.924
326847 326868    N -69.49959 47.19234 organique        2 5.52269 537.243  993.961
326848 326869    N -69.49849 47.19235 organique        2 3.69843 532.730 1071.810
327027 327048    N -69.49961 47.19309 glaciaire        4 6.70578 546.146 1028.870
327028 327049    N -69.49850 47.19310 organique        2 6.47242 540.644 1104.400
327029 327050    N -69.49739 47.19311 organique        2 3.19070 539.575 1179.800

我正在尝试预测“pres”(Y 或 N,因子)。在变量中,“depot”是一个有 6 个水平的因素。

我创建了一个矩阵来提取每次运行时的变量重要性

col_var_imp <- c("depotfluviatile", "depotfluvio-glaciaire", "depotglaciaire", "depotlacustre", "depotorganique", "depotpente", "drainage", "slope", "elev", "dtw")

table.perf <- data.table(iteration = integer(), model = character(), n = integer(),
                         accuracy = numeric(), oob = numeric(), 
                         nn = integer(), ny = integer(),
                         yn = integer(), yy = integer())

varimps <- mat.or.vec(100, 10)
colnames(varimps) <- sort(col_var_imp)
varimps[varimps == 0] <- NA

varimps 对象的列基于我创建的一个向量,将“depot”的类别列为单一的“虚拟”变量。

for (m in c("RF")) {      # m = method used: RF, GLM
  for (n in c(87)) {       # n = number of pseudo-absences
    for (i in c(1:100)) {        # i = number of repetitions, we chose 100
      cat("Starting run", i,"\n")
      
      presences <- sites[sites$pres == "Y",]
      absences <- sites[sites$pres == "N",]
      absences <- absences[sample(c(1:dim(absences)[1]), n),]
      
      selected.sites <- rbind(presences, absences)
      
      #selected.sites[which(selected.sites == 0)] <- NA
      
      train.data <- createDataPartition(selected.sites$pres, p = .7, list = FALSE)
      
      training <- selected.sites[train.data,]
      testing <- selected.sites[-train.data,]
      
      preProcValues <- preProcess(training[, -c(1:4)], method = c("center", "scale"))
      
      training[, -c(1:4)] <- predict(preProcValues, training[, -c(1:4)])
      testing[, -c(1:4)] <- predict(preProcValues, testing[, -c(1:4)])
      
      # Check for near zero variables
      nzv <- nearZeroVar(training[, -c(1:4)], saveMetrics = TRUE) #if there were some we would have to script in to remove them
      
      if (m == "RF") {
        fitControl <- trainControl(
          method = "cv",
          number = 10,
          classProbs = T,
          summaryFunction = twoClassSummary
        )
        modelFit <- train(
          form = pres ~ .,
          data = training[, -c(1,3,4)],
          method = "rf",
          #tuneGrid = expand.grid(mtry = 2),
          family = "binomial",
          trControl = fitControl,
          metric = "ROC",
          model = FALSE
        )
        
        df <- data.frame(varImp(modelFit)$importance)
        df$Var <- row.names(df)
        r <- df[sort(df$Var),]
        varimps[i,] <- c(r$Overall, rep(NA, 10 - length(r$Overall)))

        names <- rownames(r)
        png(paste0(out.png.dir, "pp", m, "_", n, "_", i, ".png"), units="in", width=11, height=11, res=300)
        par(mfrow = c(4, 4), xpd = NA) #one page only (overwrites) so adjust number according to N variables
        for (name in names) {
          pp <- partialPlot(modelFit$finalModel, training, eval(name), main=name, xlab=name)
        }
        dev.off()
      }

脚本在运行 1 中途停止并显示此错误消息:

[.data.frame
(pred.data, , xname) 中的错误: colonnes non définies sélectionnées

我试着玩这条线:varimps[i,] <- c(r$Overall, rep(NA, 10 - length(r$Overall))), but it seems to be missing 1 variable.

应该从模型“modelFit”中提取变量重要性的对象“df”只提取了 9 行,省略了类别变量“depot”的一个级别。

我不明白为什么会发生这种情况,以及这里提到的行与 rep 函数究竟是如何工作的。

r model random-forest r-caret categorical-data
© www.soinside.com 2019 - 2024. All rights reserved.