我需要将大型数据集分成一定比例的训练集、验证集和测试集,同时确保以下几点:
data <- data.frame(IDs = c(001, 001, 001,
002, 002, 002, 002,
003, 003, 003, 003,
004, 004, 004, 004, 004, 004,
005, 005, 005, 005, 005,
006, 006, 006, 006,
007, 007, 007,
008, 008,
009, 009,
010, 010, 010),
var1 = c(0102, 0210, 0405,
0318, 0629, 1201,0101,
0923, 0702, 0710, 0801,
0203, 0501, 1204, 0516, 0112, 1005,
1101, 1125, 1020, 0112, 0310,
0203, 0401, 0607, 0811,
1010, 1212, 0707,
0430, 0428,
1030, 1008,
0501, 0511, 0601),
var2 = c("cold", "cold", "cold",
"warm", "warm", "warm", "warm",
"cold", "cold", "cold", "cold",
"warm", "warm", "warm", "warm", "warm", "warm",
"hot", "hot", "hot", "hot", "hot",
"cold", "cold", "cold", "cold",
"hot", "hot", "hot",
"warm", "warm",
"hot", "hot",
"cold", "cold", "cold"))
我尝试使用数据分割包 caret(fx = createDataPartition()) 和 splitTools (fx = partition()) 以及 dplyr 采样函数,但它们应用的分组可确保每个 ID 出现在所有集合中。
减少数据集是可以的。以下是由现有 Stack Overflow 问题引导的众多尝试之一:
assignments <- data %>%
select(IDs, var2) %>%
distinct(IDs) %>%
rowwise() %>%
mutate(Group= sample(c("validation", "training", "test"), 1,
prob = c(0.70, 0.20, 0.10)))
data %>%
left_join(assignments, data, by = "IDs")
这种尝试忽略了概率论点*没有设定比例。它还不能确保所有级别都出现在训练、验证和测试集中。
一种迂回的方式,但应该扩展。这将为您的整个数据集分配一个组 ID,并随机分配每个 ID 到哪个组。请注意,组长度会有所不同,每个组中出现的每个 var2 值的百分比也会有所不同,但我相信您无论如何都在期待这一点。
library(dplyr)
set.seed(1)
rnd_grp <- data %>%
mutate(x = as.factor(IDs)) %>%
group_by(factor(x, levels = sample(levels(x)))) %>%
mutate(x = cur_group_id()) %>%
ungroup() %>%
select(IDs:x) %>%
group_by(var2, x) %>%
mutate(x = cur_group_id(),
x = (x %% 3) + 1) %>%
ungroup() %>%
arrange(x) %>% # For illustrative purposes only
mutate(x = case_when(x == 1 ~ "validation",
x == 2 ~ "training",
x == 3 ~ "test"))
data.frame(rnd_grp)
IDs var1 var2 x
1 2 318 warm validation
2 2 629 warm validation
3 2 1201 warm validation
4 2 101 warm validation
5 7 1010 hot validation
6 7 1212 hot validation
7 7 707 hot validation
8 10 501 cold validation
9 10 511 cold validation
10 10 601 cold validation
11 1 102 cold training
12 1 210 cold training
13 1 405 cold training
14 5 1101 hot training
15 5 1125 hot training
16 5 1020 hot training
17 5 112 hot training
18 5 310 hot training
19 6 203 cold training
20 6 401 cold training
21 6 607 cold training
22 6 811 cold training
23 8 430 warm training
24 8 428 warm training
25 3 923 cold test
26 3 702 cold test
27 3 710 cold test
28 3 801 cold test
29 4 203 warm test
30 4 501 warm test
31 4 1204 warm test
32 4 516 warm test
33 4 112 warm test
34 4 1005 warm test
35 9 1030 hot test
36 9 1008 hot test