r random forest error - 新数据中预测变量的类型不匹配

Question

我试图在R（quantregForest）中使用分位数回归森林函数，它建立在随机森林包上。我收到类型不匹配错误，我无法理解为什么。

我用它训练模型

qrf <- quantregForest(x = xtrain, y = ytrain)

哪个工作没有问题，但当我尝试用新数据测试时

quant.newdata <- predict(qrf, newdata= xtest)

它给出以下错误：

Error in predict.quantregForest(qrf, newdata = xtest) : 
Type of predictors in new data do not match types of the training data.

我的培训和测试数据来自单独的文件（因此是单独的数据框），但具有相同的格式。我已经检查了预测变量的类

sapply(xtrain, class)
sapply(xtest, class)

这是输出：

> sapply(xtrain, class)
pred1     pred2     pred3     pred4     pred5     pred6     pred7     pred8 
"factor" "integer" "integer" "integer"  "factor"  "factor" "integer"  "factor" 
pred9    pred10    pred11    pred12 
"factor"  "factor"  "factor"  "factor" 


> sapply(xtest, class)
pred1     pred2     pred3     pred4     pred5     pred6     pred7     pred8 
"factor" "integer" "integer" "integer"  "factor"  "factor" "integer"  "factor" 
pred9    pred10    pred11    pred12 
"factor"  "factor"  "factor"  "factor"

它们完全一样。我还检查了“NA”值。 xtrain和xtest都没有NA值。我在这里错过了一些小事吗？

更新I：对训练数据运行预测仍然会产生相同的错误

> quant.newdata <- predict(qrf, newdata = xtrain)
Error in predict.quantregForest(qrf, newdata = xtrain) : 
names of predictor variables do not match

更新II：我将训练和测试集合在一起，以便从1到101的行是训练数据，其余的是测试。我将（quantregForest）中提供的示例修改为：

data <-  read.table("toy.txt", header = T)
n <- nrow(data)
indextrain <- 1:101
xtrain <- data[indextrain, 3:14]
xtest <- data[-indextrain, 3:14]
ytrain <- data[indextrain, 15]
ytest <- data[-indextrain, 15]

qrf <- quantregForest(x=xtrain, y=ytrain)
quant.newdata <- predict(qrf, newdata= xtest)

它的工作原理！如果有人可以解释为什么它以这种方式工作而不是用另一种方式，我会很感激吗？

Answer 1

我有同样的问题。您可以尝试使用小技巧来均衡训练和测试集的类。将第一行训练集绑定到测试集，然后将其删除。对于您的示例，它应该如下所示：

    xtest <- rbind(xtrain[1, ] , xtest)
    xtest <- xtest[-1,]

Answer 2

@mgoldwasser是正确的，但是在predict.randomForest中也有一个非常讨厌的错误：即使你在训练和预测集中有完全相同的级别，也有可能得到这个错误。当您有一个将NA作为单独级别嵌入的因子时，这是可能的。问题是predict.randomForest基本上做了以下事情：

# Assume your original factor has two "proper" levels + NA level:
f <- factor(c(0,1,NA), exclude=NULL)

length(levels(f)) # => 3
levels(f)         # => "0" "1" NA

# Note that
sum(is.na(f))     # => 0
# i.e., the values of the factor are not `NA` only the corresponding level is.

# Internally predict.randomForest passes the factor (the one of the training set)
# through the function `factor(.)`.
# Unfortunately, it does _not_ do this for the prediction set.
# See what happens to f if we do that:
pf <- factor(f)

length(levels(pf)) # => 2
levels(pf)         # => "0" "1"

# In other words:
length(levels(f)) != length(levels(factor(f))) 
# => sad but TRUE

因此，它总是会丢弃训练集中的NA等级，并且总是会在预测集中看到一个额外的等级。

解决方法是在使用randomForest之前替换级别的值NA：

levels(f)[is.na(levels(f))] <- "NA"
levels(f) # => "0"  "1"  "NA"
          #              .... note that this is no longer a plain `NA`

现在调用factor(f)不会丢弃该级别，并且检查成功。

Answer 3

发生这种情况是因为训练集和测试集中的因子变量具有不同的级别（更精确的测试集没有训练中存在的某些级别）。因此，您可以通过使用以下代码来解决所有因子变量：

levels(test$SectionName) <- levels(train$SectionName)

Answer 4

扩展@ user1849895的解决方案：

common <- intersect(names(train), names(test)) 
for (p in common) { 
  if (class(train[[p]]) == "factor") { 
    levels(test[[p]]) <- levels(train[[p]]) 
  } 
}

Answer 5

这是每个不同因素的水平的问题。您需要检查以确保您的因子水平在测试和训练集之间保持一致。

这是一个奇怪的随机森林怪癖，对我来说没有意义。

Answer 6

我刚刚解决了以下问题：

## Creating sample data
values_development=factor(c("a", "b", "c")) ## Values used when building the random forest model
values_production=factor(c("a", "b", "c", "ooops")) ## New values to used when using the model

## Deleting cases which were not present when developing
values_production=sapply(as.character(values_production), function(x) if(x %in% values_development) x else NA)

## Creating the factor variable, (with the correct NA value level)
values_production=factor(values_production)

## Checking
values_production # =>  a     b     c  <NA>

r random forest error - 新数据中预测变量的类型不匹配

问题描述投票：22回答：6

6个回答

最新问题

r random forest error - 新数据中预测变量的类型不匹配

问题描述 投票：22回答：6

6个回答

最新问题

问题描述投票：22回答：6