[我正在尝试使用R中的caret
和caretEnsemble
包构建机器学习工作流程。我提供了一个示例数据集,我的实际数据大约是8-10倍。下面的代码在我遇到的特定区域有注释,我的问题是我该如何处理:
修改整个工作流程以应用于因子的每个级别,下面的样本数据集中的State变量。
针对每个州
我假设可以选择for
循环,但不确定如何编码,以便保留对所有模型的访问权限以进行进一步分析。我也知道nest
包中的tidyverse
函数,我已经成功使用它解决了类似的问题,但不确定如何使用caret
来实现。
工作流程的大多数组件都已完成,除了预测部分外,我无法进行模型预测并无法建立混淆矩阵或ROC摘要,warnings()
我认为这是我的样本数据集的结果。任何帮助表示赞赏。
library(tidyverse)
library(caret)
library(caretEnsemble)
library(pROC)
library(glmnet)
library(e1071)
library(kernlab)
library(klaR)
download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
unzip(zipfile="2010_Carrier_PUF.zip")
sampleDat <- read.csv("~/2010_BSA_Carrier_PUF.csv")
names(sampleDat)=c("sex", "age", "diagnose", "healthcare.procedure",
"typeofservice", "service.count", "provider.type", "servicesprocessed",
"place.served", "payment", "carrierline.count")
set.seed(1)
sampleDat=sampleDat %>% mutate(sex=ifelse(sex==1,"Male","Female"),
State=sample(LETTERS[1:10], 2801660,
replace=TRUE,prob=c(0.35,0.16,0.1,0.09,0.06,0.04,0.05,0.05,0.05,0.05)),
cor_var=.2*payment,
near_zero_var=1) %>%
mutate_at(c("sex","age","provider.type","State"),list(as.factor)) %>%
dplyr::select(sex,age,provider.type,State,payment,carrierline.count,cor_var,near_zero_var)
# sample 5% from each state, needed for general workflow but now for example
set.seed(2)
sampleDat=sampleDat %>% group_by(State)%>% sample_frac(.05,replace=FALSE)%>% ungroup()
y=sampleDat[,1]
x=sampleDat[,-1]
#preprocessing
#near zero variables
nzv <- nearZeroVar(x)
if(length(nzv)!=0){x=x[,-nzv ]}
## dummy Vars
xDummy <- dummyVars( ~ ., data = x)
x=as.data.frame(predict(xDummy, newdata = x))
highlyCorrelated <- findCorrelation(cor(x), cutoff=0.8)
# remove correlated variables
if(length(highlyCorrelated)!=0){x=x[,-highlyCorrelated ]}
#Training -Test split
combData=bind_cols(x,y)
#str(combData)
trainIndex <- createDataPartition(combData$sex, p = .8,
list = FALSE)
Train <- combData[ trainIndex,]
Test <- combData[-trainIndex,]
preProcValues <- preProcess(Train, method = c("range"))
Train = predict(preProcValues, Train)
Test = predict(preProcValues, Test)
## Model training and tuning
start_time=Sys.time()
set.seed(9)
fitControl <- trainControl(
method = "cv",
number = 2,
search="grid",
classProbs=TRUE,
savePredictions="final",
summaryFunction=twoClassSummary,
sampling = "down"
)
model_list <- caretList(
x=Train[,-25],y=Train[,25],
trControl=fitControl,
tuneList = list(NN=caretModelSpec(method="nnet",trace=FALSE),
GLM=caretModelSpec(method="glmnet",family="binomial",data=Train),
rf=caretModelSpec(method="rf",data=Train),
NB=caretModelSpec(method="nb")))
end_time=Sys.time()
end_time-start_time
#Time difference of 21.76968 mins
# Access models per State to evaluate models over resamples
## Prediction on test data
model_pred=predict(model_list, newdata=Test)
# Access model predictions per State to evaluate
# not sure how to get these results using the model_pred object
confusionMatrix(model_pred, reference=Test$sex,mode="everything")
twoClassSummary(Test, lev = levels(Test$sex))
这里是使用基数R和caret
在macars
数据子集上运行基本线性模型的示例,其中使用柱面数作为拆分变量。
此示例可以扩展为满足OP中列出的其他要求,例如多个模型的比较,对保留数据集的预测以及对保留预测准确性的评估。
library(caret)
data(mtcars)
carsByCyl <- split(mtcars,mtcars$cyl)
modelList <- lapply(carsByCyl,function(x){
train(x[,3:9],x$mpg,method="lm")
# code to train additional models, select, predict to holdout, and
# evaluate holdout prediction accuracy would be added here
})
summary(modelList[[1]])
...以及summary()
的输出:
> summary(modelList[[1]])
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Datsun.710 Merc.240D Merc.230 Fiat.128 Honda.Civic
-3.613e+00 1.595e+00 4.518e-01 2.618e+00 -3.690e-01
Toyota.Corolla Toyota.Corona Fiat.X1.9 Porsche.914.2 Lotus.Europa
1.660e+00 -2.047e+00 -3.548e+00 -3.331e-15 2.738e+00
Volvo.142E
5.143e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.12670 45.59393 0.858 0.454
disp 0.01707 0.25989 0.066 0.952
hp -0.10459 0.11484 -0.911 0.430
drat -3.71663 5.98606 -0.621 0.579
wt -6.73335 9.86121 -0.683 0.544
qsec 1.26936 2.19622 0.578 0.604
vs -2.53549 6.36063 -0.399 0.717
am 4.01277 9.34967 0.429 0.697
Residual standard error: 4.086 on 3 degrees of freedom
Multiple R-squared: 0.7537, Adjusted R-squared: 0.179
F-statistic: 1.312 on 7 and 3 DF, p-value: 0.45
>