如何从交叉验证中产生混淆矩阵?

问题描述 投票:1回答:2

我是R和机器学习的新手,我正在使用2个类的数据。我正在尝试进行交叉验证,但是当我尝试制作模型的混淆矩阵时,我得到一个错误,即所有参数必须具有相同的长度。我无法理解为什么我输入的内容长度不一样。任何正确方向的帮助将不胜感激。

library(MASS)
xCV = x[sample(nrow(x)),]

folds <- cut(seq(1,nrow(xCV)),breaks=10,labels=FALSE)

for(i in 1:10){

  testIndexes = which(folds==i,arr.ind=TRUE)
  testData = xCV[testIndexes, ]
  trainData = xCV[-testIndexes, ]

}
ldamodel = lda(class ~ ., trainData)
lda.predCV = predict(model)

conf.LDA.CV=table(trainData$class, lda.predCV$class)
print(conf.LDA.CV)
r machine-learning cross-validation lda
2个回答
5
投票

你的代码的问题是你没有在循环中进行建模和预测,你只需要为testIndexes生成一个i == 10,因为你覆盖了所有其他的。

以下代码将在iris数据上完成:

library(MASS)
data(iris)

生成折叠:

set.seed(1)
folds <- sample(1:10, size = nrow(irisCV), replace = T) #5 fold CV
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
10 12 17 16 21 13 17 20 12 12

或者如果你想要相同大小的折叠:

set.seed(1)
folds <- sample(rep(1:10, length.out = nrow(irisCV)), size = nrow(irisCV), replace = F)
table(folds)
#output
folds
 1  2  3  4  5  6  7  8  9 10 
15 15 15 15 15 15 15 15 15 15 

通过将模型设置为9折并预测保持来运行模型:

CV_lda <- lapply(1:10, function(x){ 
  model <- lda(Species ~ ., iris[folds != x, ])
  preds <- predict(model,  iris[folds == x,], type="response")$class
  return(data.frame(preds, real = iris$Species[folds == x]))
})

这会生成一个保持预测列表,将其组合到数据框中:

CV_lda <- do.call(rbind, CV_lda)

产生混淆矩阵:

library(caret)

confusionMatrix(CV_lda$preds, CV_lda$real)
#output
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         48         1
  virginica       0          2        49

Overall Statistics

               Accuracy : 0.98            
                 95% CI : (0.9427, 0.9959)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.97            
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.9600           0.9800
Specificity                 1.0000            0.9900           0.9800
Pos Pred Value              1.0000            0.9796           0.9608
Neg Pred Value              1.0000            0.9802           0.9899
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3200           0.3267
Detection Prevalence        0.3333            0.3267           0.3400
Balanced Accuracy           1.0000            0.9750           0.9800

0
投票

使用来自hglm.data的种子数据集

library(MASS)
data(seeds, package = "hglm.data")


seedsCV = seeds[sample(nrow(seeds)),]
folds <- cut(seq(1,nrow(seedsCV)),breaks=10,labels=FALSE)

lda.predCV <- integer(length(folds))

for(i in 1:10){

  testIndexes = which(folds==i,arr.ind=TRUE)
  testData = seedsCV[testIndexes, ]
  trainData = seedsCV[-testIndexes, ]

  ldamodel = lda(extract ~ ., trainData)

  lda.predCV[testIndexes] <- predict(ldamodel, testData)$class

}

lda.predCV <- factor(lda.predCV, labels = c("Bean", "Cucumber"))

打印混淆矩阵和准确度:

conf <- table(pred=lda.predCV, actual=seedsCV$extract)
accuracy <- sum(diag(conf))/sum(conf)

> conf
          actual
pred       Bean Cucumber
  Bean       10        0
  Cucumber    0       11


> accuracy
[1] 1
© www.soinside.com 2019 - 2024. All rights reserved.