在重新采样程序内部进行二次采样之后,如https://topepo.github.io/caret/subsampling-for-class-imbalances.html#subsampling-during-resampling所示,我的问题是如何在插入符号方法=“rf”并且采样方法为“smote”时提取由此过程产生的实际数据集。
例如,如果使用method = glm,则可以使用模型$ finalModel $ data提取数据;如果方法=“rpart”,则可以使用model $ finalModel $ call $ data类似地提取数据。
在重新采样和方法= rpart中使用子采样可以按如下方式提取smote数据集:
library(caret)
library(DMwR)
data("GermanCredit")
set.seed(122)
index1<-createDataPartition(GermanCredit$Class, p=.7, list = FALSE)
training<-GermanCredit[index1, ]
#testing<-GermanCredit[-index1,]
colnames(training)
metric <- "ROC"
ctrl1<- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5,
search = "random",
classProbs = TRUE, # note class probabilities included
savePredictions = T, #"final"
returnResamp = "final",
allowParallel = TRUE,
summaryFunction = twoClassSummary,
sampling = "smote")
set.seed(1)
mod_fit<-train(Class ~ Age +
ForeignWorker +
Property.RealEstate +
Housing.Own +
CreditHistory.Critical, data=training, method="rpart",
metric = metric,
trControl= ctrl1)
mod_fit # ROC 0.5951215
dat_smote<- mod_fit$finalModel$call$data
table(dat_smote$.outcome)
# Bad Good
# 630 840
head(dat_smote)
# Age ForeignWorker Property.RealEstate Housing.Own CreditHistory.Critical .outcome
# 40 1 0 1 1 Good
# 29 1 0 0 0 Good
# 37 1 1 0 1 Good
# 47 1 0 0 0 Good
# 53 1 0 1 0 Good
# 29 1 0 1 0 Good
我只是希望能够在method =“rf”时执行相同的数据集提取。代码可能看起来像这个dat < - mod_fit $ trainingData [mod_fit $ trainingData == mod_fit $ finalModel $ x,]
我认为唯一的方法就是编写一个custom model,将数据对象保存在fit
模块中(尽管这非常令人不满)。