我正在尝试使用 Train 函数在数据集中运行回归树。该数据集具有数值变量,我将其转换为类别变量,试图解决错误消息。我还再次使用 TrainControl 函数来尝试解决该错误。帮忙!!!
library(caret)
library(rpart)
library(mlbench)
data(Dataset)
set.seed(1)
ctrl \<- trainControl(method = "cv", savePredictions = TRUE)
model_T \<- train(VALUE\~REF_DATE+Sex+`Age at admission`+`Years since admission`+`Income type`+Statistics+UOM, data = Dataset, method = 'rpart2', trControl = ctrl)
model_T
数据集的结构:
spec_tbl_df \[46,464 x 8\] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ REF_DATE : Factor w/ 11 levels "2006","2007",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Sex : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
$ Age at admission : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 4 ...
$ Years since admission: Factor w/ 11 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Income type : Factor w/ 6 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
$ Statistics : Factor w/ 4 levels "1","2","3","4": 3 3 3 3 3 3 3 3 3 3 ...
$ UOM : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 2 2 2 2 ...
$ VALUE : num \[1:46464\] 154640 145895 151290 155340 169745 ...
问题与列名中的空格有关
library(caret)
library(rpart)
library(mlbench)
ctrl <- trainControl(method = "cv",
savePredictions =TRUE)
model_T <- train(VALUE~REF_DATE+Sex+`Age at admission`+`Years since admission`+`Income type`+Statistics+UOM,
data = Dataset, method = 'rpart2', trControl = ctrl)
#Error in `[.data.frame`(m, labs) : undefined columns selected
如果我们使用名称干净的数据集,即用下划线等替换空格,它应该可以工作 - 这里我们使用
clean_names
中的 janitor
来做到这一点
library(janitor)
Dataset2 <- clean_names(Dataset)
names(Dataset2)
#[1] "value" "ref_date" "sex" "age_at_admission" "years_since_admission" "income_type" "statistics" "uom"
现在创建模型
model_T2 <- train(value~ref_date+sex+ age_at_admission+years_since_admission+income_type+statistics+uom,
data = Dataset2, method = 'rpart2', trControl = ctrl)
-输出
> model_T2
CART
200 samples
7 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
maxdepth RMSE Rsquared MAE
1 0.9669617 0.03721968 0.7642369
2 0.9674085 0.02626375 0.7656366
6 1.0268165 0.03139845 0.8033324
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was maxdepth = 1.
set.seed(123)
Dataset <- tibble(VALUE = rnorm(200), REF_DATE = factor(rep(c(2006, 2007), each = 100)), Sex = factor(sample(1:4, size = 200, replace = TRUE)),
`Age at admission` = factor(sample(1:4, size = 200, replace = TRUE)),
`Years since admission` = factor(sample(1:11, size = 200, replace = TRUE)),
`Income type` = factor(sample(1:6, size = 200, replace = TRUE)),
Statistics = factor(sample(1:4, size = 200, replace = TRUE)),
UOM = factor(sample(1:2, size = 200, replace = TRUE))
)
显然,数据列名称中有一些
space
,这在 R 中在语法上无效。另外,请注意 ','
,它对 data frames
有效,但对模型中的公式无效。
除了 akrun 的函数和库之外,您还可以使用
make.names()
包中的 base
函数,如下所示:
names(Dataset)=make.names(names(Dataset))
一旦您修复了名称,错误消息就会消失,您的模型就会起飞。