考虑两个变量（字符和数字）的r数据分区

Question

我想基于两个变量对以下数据集（dataGenotype）进行分区;例如，基因型H13的基因型和stand_ID：stand_ID编号7可以进入训练，而stand_ID编号18和21可以进行测试。

Genotype    stand_ID    Inventory_date  stemC   mheight
H13             7        5/18/2006  1940.1075   11.33995
H13             7        11/1/2008  10898.9597  23.20395
H13             7        4/14/2009  12830.1284  23.77395
H13            18        11/3/2005  2726.42 13.4432
H13            18        6/30/2008  12226.1554  24.091967
H13            18        4/14/2009  14141.68    25.0922
H13            21        5/18/2006  4981.7158   15.7173
H13            21        4/14/2009  20327.0667  27.9155
H15            9         3/31/2006  3570.06 14.7898
H15            9         11/1/2008  15138.8383  26.2088
H15            9         4/14/2009  17035.4688  26.8778
H15           20         1/18/2005  3016.881    14.1886
H15           20        10/4/2006   8330.4688   20.19425
H15           20        6/30/2008   13576.5 25.4774
U21           3         1/9/2006    3660.416    15.09925
U21           3         6/30/2008   13236.29    24.27634
U21           3         4/14/2009   16124.192   25.79562
U21           67        11/4/2005   2812.8425   13.60485
U21           67        4/14/2009   13468.455   24.6203

所需的输出如下;

A-培训

Genotype    stand_ID    Inventory_date  stemC   mheight
H13            7         5/18/2006  1940.1075   11.33995
H13            7         11/1/2008  10898.9597  23.20395
H13            7         4/14/2009  12830.1284  23.77395
H15            9         3/31/2006  3570.06 14.7898
H15            9         11/1/2008  15138.8383  26.2088
H15            9         4/14/2009  17035.4688  26.8778
U21            67        11/4/2005  2812.8425   13.60485
U21            67        4/14/2009  13468.455   24.6203

B-测试

Genotype    stand_ID    Inventory_date  stemC   mheight
H13             18       11/3/2005  2726.42 13.4432
H13             18       6/30/2008  12226.1554  24.091967
H13             18       4/14/2009  14141.68    25.0922
H13             21       5/18/2006  4981.7158   15.7173
H13             21       4/14/2009  20327.0667  27.9155
H15             20       1/18/2005  3016.881    14.1886
H15             20       10/4/2006  8330.4688   20.19425
H15             20       6/30/2008  13576.5 25.4774
U21             3        1/9/2006   3660.416    15.09925
U21             3        6/30/2008  13236.29    24.27634
U21             3        4/14/2009  16124.192   25.79562

我尝试了以下代码;

library(caret)
clonePartitioning <- createDataPartition(dataGenotype$stand_ID,1,list=F,p=0.2)
train = dataGenotype[clonePartitioning,]
test = dataGenotype[-clonePartitioning,]

也试过了

createDataPartition(unique(dataGenotype$stand_ID),1,list=F,p=0.2)

它没有产生所需的输出，数据在stand_ID中被分区。例如，一行stand_ID 7进入训练，两行stand_ID 7进入测试阶段。如何在stand_ID中按基因型分区数据？

Answer 1

这是一种使用dplyr的方法

library(tidyverse)
set.seed(1) #for reproducibility of the split 
df %>%
  group_by(Genotype) %>% #group data by Genotype
  distinct(stand_ID) %>% #filter unqiue stand_ID
  sample_frac(.2) %>% #sample these stand_ID's with a fraction of your choice
  mutate(data = "test") %>% #labels the samples as test
  right_join(df) %>% #right join to original data frame, train samples will be NA
  pull(data) %>% #pull the vector with test/NA indeces
  is.na -> train_ind #see which ones are NA

df[train_ind,]
   Genotype stand_ID Inventory_date     stemC  mheight
4       H13       18      11/3/2005  2726.420 13.44320
5       H13       18      6/30/2008 12226.155 24.09197
6       H13       18      4/14/2009 14141.680 25.09220
7       H13       21      5/18/2006  4981.716 15.71730
8       H13       21      4/14/2009 20327.067 27.91550
9       H15        9      3/31/2006  3570.060 14.78980
10      H15        9      11/1/2008 15138.838 26.20880
11      H15        9      4/14/2009 17035.469 26.87780
15      H15       32       2/1/2006  3426.253 14.31815
16      U21        3       1/9/2006  3660.416 15.09925
17      U21        3      6/30/2008 13236.290 24.27634
18      U21        3      4/14/2009 16124.192 25.79562
19      U21       67      11/4/2005  2812.843 13.60485
20      U21       67      4/14/2009 13468.455 24.62030

df[!train_ind,]
   Genotype stand_ID Inventory_date     stemC  mheight
1       H13        7      5/18/2006  1940.108 11.33995
2       H13        7      11/1/2008 10898.960 23.20395
3       H13        7      4/14/2009 12830.128 23.77395
12      H15       20      1/18/2005  3016.881 14.18860
13      H15       20      10/4/2006  8330.469 20.19425
14      H15       20      6/30/2008 13576.500 25.47740

考虑两个变量（字符和数字）的r数据分区

问题描述投票：0回答：1

1个回答

最新问题

考虑两个变量（字符和数字）的r数据分区

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1