插入 - 基于几个变量创建分层数据集

Question

在R包插入符号中，我们可以使用函数createDataPartition（）（或createFolds（）进行交叉验证）基于几个变量创建分层训练和测试集吗？

以下是一个变量的示例：

#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]

在上面的代码中，训练和测试集由'df $ yourFactor'分层。但是有可能使用几个变量进行分层（例如'df $ yourFactor'和'df $ yourFactor2'）？以下代码似乎有效，但我不知道它是否正确：

inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)

Answer 1

有一种更好的方法可以做到这一点。

set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE), 
                yourFactor2 = rbinom(n,1,.5),
                yourFactor3 = rbinom(n,1,.7))

地层指标

d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])

样本选择

indices <- tapply(1:nrow(d), d$group, sample, 30 )

获得子样本

subsampd <- d[unlist(indices, use.names = FALSE), ]

这样做是为了在yourFactor和yourFactor2的每个组合上制作30个随机分层样本。

Answer 2

如果你使用tidyverse，这很简单。

例如：

df <- df %>%
  mutate(n = row_number()) %>% #create row number if you dont have one
  select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
  group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
  sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'

插入 - 基于几个变量创建分层数据集

问题描述投票：0回答：2

2个回答

地层指标

样本选择

获得子样本

最新问题

插入 - 基于几个变量创建分层数据集

问题描述 投票：0回答：2

2个回答

地层指标

样本选择

获得子样本

最新问题

问题描述投票：0回答：2