将数据拆分为训练集、验证集和测试集,ID 不重叠,并且仍然平衡目标类

问题描述 投票:0回答:1

我需要将大型数据集分成一定比例的训练集、验证集和测试集,同时确保以下几点:

  • 在每组中保留唯一的 ID。任何 ID 不能多于一组。
  • 在每次数据重组时,训练、验证和测试集中每个级别(“热”、“暖”、“冷”)至少需要出现一次。
data <- data.frame(IDs = c(001, 001, 001, 
                           002, 002, 002, 002, 
                           003, 003, 003, 003, 
                           004, 004, 004, 004, 004, 004, 
                           005, 005, 005, 005, 005, 
                           006, 006, 006, 006,
                           007, 007, 007,
                           008, 008, 
                           009, 009, 
                           010, 010, 010),
                   var1 = c(0102, 0210, 0405, 
                            0318, 0629, 1201,0101, 
                            0923, 0702, 0710, 0801,
                            0203, 0501, 1204, 0516, 0112, 1005, 
                            1101, 1125, 1020, 0112, 0310,
                            0203, 0401, 0607, 0811,
                            1010, 1212, 0707,
                            0430, 0428,
                            1030, 1008,
                            0501, 0511, 0601),
                   var2 = c("cold", "cold", "cold", 
                            "warm", "warm", "warm", "warm",
                            "cold", "cold", "cold", "cold", 
                            "warm", "warm", "warm", "warm", "warm", "warm",   
                            "hot", "hot", "hot", "hot", "hot",
                            "cold", "cold", "cold", "cold", 
                            "hot", "hot", "hot",
                            "warm", "warm",
                            "hot", "hot",
                            "cold", "cold", "cold"))

我尝试使用数据分割包 caret(fx = createDataPartition()) 和 splitTools (fx = partition()) 以及 dplyr 采样函数,但它们应用的分组可确保每个 ID 出现在所有集合中。

减少数据集是可以的。以下是由现有 Stack Overflow 问题引导的众多尝试之一:

assignments <- data %>%
        select(IDs, var2) %>%
        distinct(IDs) %>%
        rowwise() %>%
        mutate(Group= sample(c("validation", "training", "test"), 1, 
                             prob = c(0.70, 0.20, 0.10)))
data %>%
  left_join(assignments, data, by = "IDs")

这种尝试忽略了概率论点*没有设定比例。它还不能确保所有级别都出现在训练、验证和测试集中。

r machine-learning dplyr
1个回答
0
投票

一种迂回的方式,但应该扩展。这将为您的整个数据集分配一个组 ID,并随机分配每个 ID 到哪个组。请注意,组长度会有所不同,每个组中出现的每个 var2 值的百分比也会有所不同,但我相信您无论如何都在期待这一点。

library(dplyr)
set.seed(1)

rnd_grp <- data %>%
  mutate(x = as.factor(IDs)) %>%
  group_by(factor(x, levels = sample(levels(x)))) %>%
  mutate(x = cur_group_id()) %>%
  ungroup() %>%
  select(IDs:x) %>%
  group_by(var2, x) %>%
  mutate(x = cur_group_id(),
         x = (x %% 3) + 1) %>%
  ungroup() %>%
  arrange(x) %>% # For illustrative purposes only
  mutate(x = case_when(x == 1 ~ "validation",
                       x == 2 ~ "training",
                       x == 3 ~ "test"))
  
data.frame(rnd_grp)
   IDs var1 var2          x
1    2  318 warm validation
2    2  629 warm validation
3    2 1201 warm validation
4    2  101 warm validation
5    7 1010  hot validation
6    7 1212  hot validation
7    7  707  hot validation
8   10  501 cold validation
9   10  511 cold validation
10  10  601 cold validation
11   1  102 cold   training
12   1  210 cold   training
13   1  405 cold   training
14   5 1101  hot   training
15   5 1125  hot   training
16   5 1020  hot   training
17   5  112  hot   training
18   5  310  hot   training
19   6  203 cold   training
20   6  401 cold   training
21   6  607 cold   training
22   6  811 cold   training
23   8  430 warm   training
24   8  428 warm   training
25   3  923 cold       test
26   3  702 cold       test
27   3  710 cold       test
28   3  801 cold       test
29   4  203 warm       test
30   4  501 warm       test
31   4 1204 warm       test
32   4  516 warm       test
33   4  112 warm       test
34   4 1005 warm       test
35   9 1030  hot       test
36   9 1008  hot       test
© www.soinside.com 2019 - 2024. All rights reserved.