根据机器学习相关变量将数据分为训练集、测试集和验证集

问题描述 投票:0回答:1

我正在尝试将数据分为训练组、测试组和验证组。我有 2 个组:对照组和 TP,在这些组中,我有一个名为 Bio 的辅助变量,两组中的数字均为 1-4。

在组内,我需要根据治疗组(对照组或 TP)进行划分,然后根据 Bio 作为因变量,这样,如果我在训练集中有控制 1,我就拥有所有控制 1 组和所有TP 1 也是如此。虽然我下面的示例数据在生物分组中具有相同的数字,例如3 这与其余数据不一样,并且不同 Bio 中有不同的数字。

请参阅下面的最小数据集:

Sample    Treatment Bio  285.945846 286.9638976 288.1004758 288.8109355
Control1_A13   Control   1 0.003535191 0.001777255 0.004729780 0.002364995
Control1_A14   Control   1 0.005063256 0.000110063 0.006249624 0.001041584
Control1_A15   Control   1 0.004262099 0.000836256 0.004277461 0.002699177
Control2_B13   Control   2 0.002411720 0.000466887 0.001129674 0.001109870
Control2_B14   Control   2 0.003085647 0.001831629 0.002482230 0.000000000
Control2_B15   Control   2 0.001996473 0.001060616 0.003995243 0.001369387
Control3_C13   Control   3 0.000299744 0.000851944 0.002808119 0.004065315
Control3_C14   Control   3 0.003187073 0.000591202 0.006833653 0.001713096
Control3_C15   Control   3 0.003692511 0.000262144 0.004673039 0.000126174
Control4_D13   Control   4 0.003369294 0.001087459 0.005171894 0.000675702
Control4_D14   Control   4 0.003818057 0.000838719 0.005513885 0.000458708
Control4_D15   Control   4 0.002572840 0.000257058 0.003537029 0.000009040
LX2+TP1_E1          TP   1 0.003347067 0.001231945 0.008181087 0.004436654
LX2+TP1_E2          TP   1 0.001552547 0.001463769 0.008864838 0.002728083
LX2+TP1_E3          TP   1 0.003224648 0.000812735 0.008518836 0.004303950
LX2+TP2_F1          TP   2 0.001705551 0.000182659 0.000911028 0.000240785
LX2+TP2_F2          TP   2 0.000760944 0.000759464 0.002486596 0.002377735
LX2+TP2_F3          TP   2 0.001034440 0.000647382 0.008146538 0.001028800
LX2+TP3_G1          TP   3 0.003660741 0.001260433 0.008046637 0.003182006
LX2+TP3_G2          TP   3 0.001802459 0.000547580 0.004882082 0.004121552
LX2+TP3_G3          TP   3 0.003590003 0.000089100 0.002801237 0.000403527
LX2+TP4_H1          TP   4 0.002831592 0.001534135 0.009151124 0.003021942
LX2+TP4_H2          TP   4 0.001863099 0.000959953 0.008284829 0.005169246
LX2+TP4_H3          TP   4 0.005649448 0.001959382 0.011814467 0.004110110

我尝试了两种不同的方法来做到这一点:

  • 方法1
set.seed(1234)
inTraining <- createDataPartition(vis_data2$Treatment, p=0.6, list=FALSE)
training.set <- vis_data2[inTraining,]
Totalvalidation.set <- vis_data2[-inTraining,]
# This will create another partition of the 40% of the data, so 20%-testing and #20%-validation
inValidation <- createDataPartition(Totalvalidation.set$Treatment, p=0.5, list=FALSE)
testing.set <- Totalvalidation.set[inValidation,]
validation.set <- Totalvalidation.set[-inValidation,]

然而,这并没有考虑到我的第二个变量 - 生物分组

  • 方法2
set.seed(1)
#Split into training and validation data sets
Y1 = vis_data2[,1] #defining treatment/ variable column 
g1 = vis_data2[,3] #defines group column
final_vis_data <- sample.split(Y1,SplitRatio = 0.5,group = g1)
table(Y1,final_vis_data) #get correct split ratios
split(final_vis_data,g1) #while keeping samples with the same group label together
full_train_set <- vis_data2[ final_vis_data,]
test.set <- vis_data2[!final_vis_data,]

#Split training data set into training and testing data sets
Y2 = full_train_set[,1] #defining treatment/ variable column 
g2 = full_train_set[,3] #defines group column
final_vis_data2 <- sample.split(Y2,SplitRatio = 0.5,group = g2)
table(Y2,final_vis_data2) #get correct split ratios
split(final_vis_data2,g2) #while keeping samples with the same group label together
test.set <- full_train_set[final_vis_data2,1:3]
validation.set <- full_train_set[!final_vis_data2,1:3]

但是,当我运行此程序时,我经常在validation.index中得到“na”值,并且当我检查分割时,生物数据经常没有正确分割。

如何让它发挥作用?

r machine-learning r-caret data-partitioning
1个回答
0
投票

这个答案使用

rsample
中的函数,并且不使用 Caret 的分区函数。它有望帮助您创建模型拟合的初始分割。

为了演示您为验证集所描述的拆分测试数据,我需要进行一些额外的组。

set.seed(123)
library(rsample)

df_split <- group_initial_split(df, group = Bio, prop = 0.6)

df_training <- training(df_split)
df_testing <- testing(df_split)

df_validation <- group_validation_split(df_testing, group = Bio, prop = 0.5)

df_analysis <- analysis(df_validation$splits[[1]])
df_assessment <- assessment(df_validation$splits[[1]])

levels(factor(df_training$Bio))
#> [1] "2"  "3"  "6"  "8"  "9"  "10"
levels(factor(df_testing$Bio))
#> [1] "1" "4" "5" "7"
levels(factor(df_analysis$Bio))
#> [1] "1" "5"
levels(factor(df_assessment$Bio))
#> [1] "4" "7"

创建于 2023-08-17,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.