使用R的每个因子使用插入号拟合多个模型

问题描述 投票:0回答:1

[我正在尝试使用R中的caretcaretEnsemble包构建机器学习工作流程。我提供了一个示例数据集,我的实际数据大约是8-10倍。下面的代码在我遇到的特定区域有注释,我的问题是我该如何处理:

  • 修改整个工作流程以应用于因子的每个级别,下面的样本数据集中的State变量。

    针对每个州

    • 记录的样本百分比
    • 执行预处理
    • 分为训练和测试集
    • 拟合3-4个模型并进行评估
    • 预测和评估

我假设可以选择for循环,但不确定如何编码,以便保留对所有模型的访问权限以进行进一步分析。我也知道nest包中的tidyverse函数,我已经成功使用它解决了类似的问题,但不确定如何使用caret来实现。

工作流程的大多数组件都已完成,除了预测部分外,我无法进行模型预测并无法建立混淆矩阵或ROC摘要,warnings()我认为这是我的样本数据集的结果。任何帮助表示赞赏。

library(tidyverse)
library(caret)
library(caretEnsemble)
library(pROC)
library(glmnet)
library(e1071)
library(kernlab)
library(klaR)
download.file("http://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/BSAPUFS/Downloads/2010_Carrier_PUF.zip", "2010_Carrier_PUF.zip")
unzip(zipfile="2010_Carrier_PUF.zip")
sampleDat <- read.csv("~/2010_BSA_Carrier_PUF.csv")
names(sampleDat)=c("sex", "age", "diagnose", "healthcare.procedure",
  "typeofservice", "service.count", "provider.type", "servicesprocessed",
  "place.served", "payment", "carrierline.count")

set.seed(1)
sampleDat=sampleDat %>% mutate(sex=ifelse(sex==1,"Male","Female"),
                                State=sample(LETTERS[1:10], 2801660,
                                replace=TRUE,prob=c(0.35,0.16,0.1,0.09,0.06,0.04,0.05,0.05,0.05,0.05)),
                                cor_var=.2*payment,
                                near_zero_var=1) %>% 
                                mutate_at(c("sex","age","provider.type","State"),list(as.factor)) %>% 
                                dplyr::select(sex,age,provider.type,State,payment,carrierline.count,cor_var,near_zero_var) 


# sample 5% from each state, needed for general workflow but now for example  
set.seed(2)
sampleDat=sampleDat %>% group_by(State)%>% sample_frac(.05,replace=FALSE)%>% ungroup() 

y=sampleDat[,1]
x=sampleDat[,-1]

#preprocessing

#near zero variables
nzv <- nearZeroVar(x)
if(length(nzv)!=0){x=x[,-nzv ]}

## dummy Vars

xDummy <- dummyVars( ~ ., data = x)
x=as.data.frame(predict(xDummy, newdata = x))


highlyCorrelated <- findCorrelation(cor(x), cutoff=0.8)
# remove correlated variables
if(length(highlyCorrelated)!=0){x=x[,-highlyCorrelated ]}

#Training -Test split

combData=bind_cols(x,y)
#str(combData)
trainIndex <- createDataPartition(combData$sex, p = .8, 
                                  list = FALSE)

Train <- combData[ trainIndex,]
Test  <- combData[-trainIndex,]

preProcValues <- preProcess(Train, method = c("range"))

Train = predict(preProcValues, Train)
Test  = predict(preProcValues, Test)


## Model training and tuning

start_time=Sys.time()
set.seed(9)
fitControl <- trainControl(
  method = "cv",
  number = 2,
  search="grid",
  classProbs=TRUE,
  savePredictions="final",
  summaryFunction=twoClassSummary,
  sampling = "down"
 )

model_list <- caretList(
  x=Train[,-25],y=Train[,25], 
  trControl=fitControl,
  tuneList = list(NN=caretModelSpec(method="nnet",trace=FALSE),
                  GLM=caretModelSpec(method="glmnet",family="binomial",data=Train),
                  rf=caretModelSpec(method="rf",data=Train),
                  NB=caretModelSpec(method="nb")))

end_time=Sys.time()
end_time-start_time
#Time difference of 21.76968 mins

#  Access models per State to evaluate models over resamples 

## Prediction on test data

model_pred=predict(model_list, newdata=Test)
# Access model predictions per State to evaluate
# not sure how to  get these results using the model_pred object

confusionMatrix(model_pred, reference=Test$sex,mode="everything")
twoClassSummary(Test, lev = levels(Test$sex))
r machine-learning caret
1个回答
0
投票

这里是使用基数R和caretmacars数据子集上运行基本线性模型的示例,其中使用柱面数作为拆分变量。

此示例可以扩展为满足OP中列出的其他要求,例如多个模型的比较,对保留数据集的预测以及对保留预测准确性的评估。

library(caret)
data(mtcars)
carsByCyl <- split(mtcars,mtcars$cyl)

modelList <- lapply(carsByCyl,function(x){
     train(x[,3:9],x$mpg,method="lm")

     # code to train additional models, select, predict to holdout, and 
     # evaluate holdout prediction accuracy would be added here
})

summary(modelList[[1]])

...以及summary()的输出:

> summary(modelList[[1]])

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
    Datsun.710      Merc.240D       Merc.230       Fiat.128    Honda.Civic 
    -3.613e+00      1.595e+00      4.518e-01      2.618e+00     -3.690e-01 
Toyota.Corolla  Toyota.Corona      Fiat.X1.9  Porsche.914.2   Lotus.Europa 
     1.660e+00     -2.047e+00     -3.548e+00     -3.331e-15      2.738e+00 
    Volvo.142E 
     5.143e-01 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.12670   45.59393   0.858    0.454
disp         0.01707    0.25989   0.066    0.952
hp          -0.10459    0.11484  -0.911    0.430
drat        -3.71663    5.98606  -0.621    0.579
wt          -6.73335    9.86121  -0.683    0.544
qsec         1.26936    2.19622   0.578    0.604
vs          -2.53549    6.36063  -0.399    0.717
am           4.01277    9.34967   0.429    0.697

Residual standard error: 4.086 on 3 degrees of freedom
Multiple R-squared:  0.7537,    Adjusted R-squared:  0.179 
F-statistic: 1.312 on 7 and 3 DF,  p-value: 0.45

> 
© www.soinside.com 2019 - 2024. All rights reserved.