R：knn + pca，选择未定义的列

Question

我试图在预测中使用knn，但是首先要进行主成分分析以减少维数。

但是，在我生成主要组件并将它们应用于knn之后，它会产生错误

“[.data.frame（data ,, all.vars（Terms），drop = FALSE）出错：未定义列选择“

以及警告：

“另外：警告信息：在nominalTrainWorkflow中（x = x，y = y，wts = weights，info = trainInfo，：重采样性能测量中存在缺失值。”

这是我的样本：

sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
  data.frame()

训练集中的前15个

train1 = sample[1:15, ]
test = sample[16:20, ]

消除因变量

pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]

train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)
k = train(train1[,1] ~ .,
          method     = "knn",
          tuneGrid   = expand.grid(k = 1:5),
          trControl  = train.control, preProcess='scale',
          metric     = "RMSE",
          data       = cbind(train1[,1], pca.tr))

任何建议表示赞赏！

Answer 1

使用更好的列名称和没有下标的公式。

你真的应该尝试发布一个可重复的例子。你的一些代码是错误的。

此外，preProc有一种“pca”方法，通过重新计算重新取样中的PCA分数来做适当的事情。

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(55)
sample = cbind(rnorm(20, 100, 10), matrix(rnorm(100, 10, 2), nrow = 20)) %>%
  data.frame()

train1 = sample[1:15, ]
test = sample[16:20, ]

pca.tr=sample[1:15,2:6]
pcom = prcomp(pca.tr, scale.=T)
pca.tr=data.frame(True=train1[,1], pcom$x)
#select the first 2 principal components
pca.tr = pca.tr[, 1:2]

dat <- cbind(train1[,1], pca.tr) %>% 
  # This
  setNames(c("y", "True", "PC1"))

train.ct = trainControl(method = "repeatedcv", number = 3, repeats=1)

set.seed(356)
k = train(y ~ .,
          method     = "knn",
          tuneGrid   = expand.grid(k = 1:5),
          trControl  = train.ct, # this argument was wrong in your code
          preProcess='scale',
          metric     = "RMSE",
          data       = dat)
k
#> k-Nearest Neighbors 
#> 
#> 15 samples
#>  2 predictor
#> 
#> Pre-processing: scaled (2) 
#> Resampling: Cross-Validated (3 fold, repeated 1 times) 
#> Summary of sample sizes: 11, 10, 9 
#> Resampling results across tuning parameters:
#> 
#>   k  RMSE      Rsquared   MAE     
#>   1  4.979826  0.4332661  3.998205
#>   2  5.347236  0.3970251  4.312809
#>   3  5.016606  0.5977683  3.939470
#>   4  4.504474  0.8060368  3.662623
#>   5  5.612582  0.5104171  4.500768
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 4.

# or 
set.seed(356)
train(X1 ~ .,
      method     = "knn",
      tuneGrid   = expand.grid(k = 1:5),
      trControl  = train.ct, 
      preProcess= c('pca', 'scale'),
      metric     = "RMSE",
      data       = train1)
#> k-Nearest Neighbors 
#> 
#> 15 samples
#>  5 predictor
#> 
#> Pre-processing: principal component signal extraction (5), scaled
#>  (5), centered (5) 
#> Resampling: Cross-Validated (3 fold, repeated 1 times) 
#> Summary of sample sizes: 11, 10, 9 
#> Resampling results across tuning parameters:
#> 
#>   k  RMSE       Rsquared   MAE      
#>   1  13.373189  0.2450736  10.592047
#>   2  10.217517  0.2952671   7.973258
#>   3   9.030618  0.2727458   7.639545
#>   4   8.133807  0.1813067   6.445518
#>   5   8.083650  0.2771067   6.551053
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final value used for the model was k = 5.

由reprex package创建于2019-04-15（v0.2.1）

这些在RMSE方面看起来更糟，但之前的运行低估了RMSE，因为它假设PCA得分没有变化。

R：knn + pca，选择未定义的列

问题描述投票：0回答：1

1个回答

最新问题

R：knn + pca，选择未定义的列

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1