我正在重新调整同事编写的随机森林脚本的用途,以使用空间变量和 caret 包运行模型的 100 次迭代,但脚本最初编写时并未考虑分类数据。
这是我的数据的样子
head(sites)
ID pres X Y depot drainage slope elev dtw
326670 326691 N -69.49958 47.19158 organique 2 2.53601 531.158 964.924
326847 326868 N -69.49959 47.19234 organique 2 5.52269 537.243 993.961
326848 326869 N -69.49849 47.19235 organique 2 3.69843 532.730 1071.810
327027 327048 N -69.49961 47.19309 glaciaire 4 6.70578 546.146 1028.870
327028 327049 N -69.49850 47.19310 organique 2 6.47242 540.644 1104.400
327029 327050 N -69.49739 47.19311 organique 2 3.19070 539.575 1179.800
我正在尝试预测“pres”(Y 或 N,因子)。在变量中,“depot”是一个有 6 个水平的因素。
我创建了一个矩阵来提取每次运行时的变量重要性
col_var_imp <- c("depotfluviatile", "depotfluvio-glaciaire", "depotglaciaire", "depotlacustre", "depotorganique", "depotpente", "drainage", "slope", "elev", "dtw")
table.perf <- data.table(iteration = integer(), model = character(), n = integer(),
accuracy = numeric(), oob = numeric(),
nn = integer(), ny = integer(),
yn = integer(), yy = integer())
varimps <- mat.or.vec(100, 10)
colnames(varimps) <- sort(col_var_imp)
varimps[varimps == 0] <- NA
varimps 对象的列基于我创建的一个向量,将“depot”的类别列为单一的“虚拟”变量。
for (m in c("RF")) { # m = method used: RF, GLM
for (n in c(87)) { # n = number of pseudo-absences
for (i in c(1:100)) { # i = number of repetitions, we chose 100
cat("Starting run", i,"\n")
presences <- sites[sites$pres == "Y",]
absences <- sites[sites$pres == "N",]
absences <- absences[sample(c(1:dim(absences)[1]), n),]
selected.sites <- rbind(presences, absences)
#selected.sites[which(selected.sites == 0)] <- NA
train.data <- createDataPartition(selected.sites$pres, p = .7, list = FALSE)
training <- selected.sites[train.data,]
testing <- selected.sites[-train.data,]
preProcValues <- preProcess(training[, -c(1:4)], method = c("center", "scale"))
training[, -c(1:4)] <- predict(preProcValues, training[, -c(1:4)])
testing[, -c(1:4)] <- predict(preProcValues, testing[, -c(1:4)])
# Check for near zero variables
nzv <- nearZeroVar(training[, -c(1:4)], saveMetrics = TRUE) #if there were some we would have to script in to remove them
if (m == "RF") {
fitControl <- trainControl(
method = "cv",
number = 10,
classProbs = T,
summaryFunction = twoClassSummary
)
modelFit <- train(
form = pres ~ .,
data = training[, -c(1,3,4)],
method = "rf",
#tuneGrid = expand.grid(mtry = 2),
family = "binomial",
trControl = fitControl,
metric = "ROC",
model = FALSE
)
df <- data.frame(varImp(modelFit)$importance)
df$Var <- row.names(df)
r <- df[sort(df$Var),]
varimps[i,] <- c(r$Overall, rep(NA, 10 - length(r$Overall)))
names <- rownames(r)
png(paste0(out.png.dir, "pp", m, "_", n, "_", i, ".png"), units="in", width=11, height=11, res=300)
par(mfrow = c(4, 4), xpd = NA) #one page only (overwrites) so adjust number according to N variables
for (name in names) {
pp <- partialPlot(modelFit$finalModel, training, eval(name), main=name, xlab=name)
}
dev.off()
}
脚本在运行 1 中途停止并显示此错误消息:
[.data.frame
(pred.data, , xname) 中的错误:
colonnes non définies sélectionnées
我试着玩这条线:varimps[i,] <- c(r$Overall, rep(NA, 10 - length(r$Overall))), but it seems to be missing 1 variable.
应该从模型“modelFit”中提取变量重要性的对象“df”只提取了 9 行,省略了类别变量“depot”的一个级别。
我不明白为什么会发生这种情况,以及这里提到的行与 rep 函数究竟是如何工作的。