PCA在交叉验证中;但是,仅包含变量的子集

问题描述 投票:0回答:1

此问题与preprocess within cross-validation in caret非常相似;但是,在我正在从事的项目中,我只希望对我的案例中的19个中的三个预测变量进行PCA。这是preprocess within cross-validation in caret中的示例,为方便起见,我将使用此数据(PimaIndiansDiabetes)(这不是我的项目数据,但概念应该相同)。然后,我只想对变量的子集进行预处理,即PimaIndiansDiabetes [,c(4,5,6)]。有办法吗?

library(caret)
library(mlbench)
data(PimaIndiansDiabetes)

control <- trainControl(method="cv", 
                        number=5)
p <- preProcess(PimaIndiansDiabetes[, c(4,5,6)], #only do these columns!
                     method = c("center", "scale", "pca"))
p
grid=expand.grid(mtry=c(1,2,3))

model <- train(diabetes~., data=PimaIndiansDiabetes, method="rf", 
               preProcess= p, 
               trControl=control,
               tuneGrid=grid)

但我收到此错误:

Error: pre-processing methods are limited to: BoxCox, YeoJohnson, expoTrans, invHyperbolicSine, center, scale, range, knnImpute, bagImpute, medianImpute, pca, ica, spatialSign, ignore, keep, remove, zv, nzv, conditionalX, corr

之所以这样做,是因为我可以将三个变量减少为一个PCA1并用于预测。在我正在做的项目中,所有三个变量的相关性均高于90%,但由于其他研究也使用了它们,因此希望将它们合并。谢谢。试图避免数据泄漏!

machine-learning pca cross-validation r-caret
1个回答
0
投票

据我所知,这用插入号是不可能的。使用recipes可能有可能。但是,我不使用食谱,但使用mlr3,因此我将展示如何使用此软件包:

library(mlr3)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
library(mlr3tuning)
library(mlbench)

根据数据创建任务:

data("PimaIndiansDiabetes")

pima_tsk <- TaskClassif$new(id = "Pima",
                            backend = PimaIndiansDiabetes,
                            target = "diabetes")

定义一个名为“ slct1”的预处理器选择器:

pos1 <- po("select", id = "slct1")

并在其中定义选择器功能:

pos1$param_set$values$selector <- selector_name(colnames(PimaIndiansDiabetes[, 4:6]))

现在定义所选功能会发生什么:缩放->选择第一台PC(param_vals = list(rank. = 1))的pca

pos1 %>>%
  po("scale", id = "scale1") %>>%
  po("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1

现在定义一个反向选择器:

pos2 <- po("select", id = "slct2")

pos2$param_set$values$selector <- selector_invert(pos1$param_set$values$selector)

定义学习者:

rf_lrn <- po("learner", lrn("classif.ranger")) #ranger is a faster version of rf

组合它们:

gunion(list(pr1, pos2)) %>>%
  po("featureunion") %>>%
  rf_lrn -> graph

检查是否正常:

graph$plot(html = TRUE)

enter image description here

将图形转换为学习者:

glrn <- GraphLearner$new(graph)

定义要调整的参数:

ps <-  ParamSet$new(list(
  ParamInt$new("classif.ranger.mtry", lower = 1, upper = 6),
  ParamInt$new("classif.ranger.num.trees", lower = 100, upper = 1000)))

定义重采样:

cv10 <- rsmp("cv", folds = 10)

定义调整:

instance <- TuningInstance$new(
  task = pima_tsk,
  learner = glrn,
  resampling = cv10,
  measures = msr("classif.ce"),
  param_set = ps,
  terminator = term("evals", n_evals = 20)
)

set.seed(1)
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result

有关如何调整PC组件数量以保持不变的更多详细信息,请检查此答案:R caret: How do I apply separate pca to different dataframes before training?

如果您发现这个有趣的内容,请查看mlr3book

cor(PimaIndiansDiabetes[, 4:6])
          triceps   insulin      mass
triceps 1.0000000 0.4367826 0.3925732
insulin 0.4367826 1.0000000 0.1978591
mass    0.3925732 0.1978591 1.0000000

不会产生您在问题中提到的内容。

© www.soinside.com 2019 - 2024. All rights reserved.