如何根据因子从另一个数据集中采样?

问题描述 投票:0回答:3

我有两个数据集。

df_sample
包含一些测量数据
param
跨站点和象限(带重复)。我想用这个数据集来填充
df

set.seed(111)

#This is the dataset I want to draw the sample from
site <- rep(c("1","2","3"), each = 20)
quad <- rep(c("1","2","3","4","5"), rep = 12)
param <- rnorm(60,5,1)

df_sample <- data.frame(site,quad, param)


#This is the dataset I want to add the sampling to
month <- rep(c("J","J","J","F","M"), each = 5)
site <- rep(c("1","2","3","1","2"), each = 5)
quad <- rep(c("1","2","3","4","5"), rep = 5)

df <- data.frame(month,site,quad)

请注意,第一个数据集仅包含在各个象限中多次测量参数的站点。现在,在

df
中,我想创建一个新的列参数。对于每个月和站点,此参数将仅从其相应的站点和象限中随机抽样。所以基本上,每个站点和象限都可以采用三个值之一。我怎样才能做到这一点?

df$param <- sample(df_sample$param)

As an example `df` could look like this

month site quad param
  J    1     1    4.8236
  J    1     2    3.502
  J    1     3    6.84
 ...
r loops for-loop dplyr sample
3个回答
1
投票

df_sample
中添加一个 ID 列,将每个重复标记为 1、2 或 3。(我称之为
run
。)在
run
中添加相应的随机选择的
df
列。然后加入:

library(dplyr)
df_sample = df_sample %>% 
  group_by(site, quad) %>%
  mutate(run = row_number()) %>%
  ungroup()

df %>%
  mutate(run = sample(1:3, size = n(), replace = TRUE)) %>%
  left_join(df_sample)
# Joining, by = c("site", "quad", "run")
#    month site quad run    param
# 1      J    1    1   2 5.140278
# 2      J    1    2   2 3.502573
# 3      J    1    3   1 4.688376
# 4      J    1    4   1 2.697654
# 5      J    1    5   3 5.797529
# 6      J    2    1   2 5.598254
# 7      J    2    2   2 3.158466
# 8      J    2    3   2 7.718056
# 9      J    2    4   3 3.379530
# 10     J    2    5   3 2.734004
# 11     J    3    1   1 3.824274
# 12     J    3    2   2 5.331380
# 13     J    3    3   3 5.914242
# 14     J    3    4   3 5.358625
# 15     J    3    5   3 5.175096
# 16     F    1    1   2 5.140278
# 17     F    1    2   1 4.669264
# 18     F    1    3   3 6.845636
# 19     F    1    4   1 2.697654
# 20     F    1    5   2 4.506038
# 21     M    2    1   3 1.886783
# 22     M    2    2   1 5.346964
# 23     M    2    3   2 7.718056
# 24     M    2    4   2 5.191244
# 25     M    2    5   2 3.698704

1
投票

data.table

set.seed(1)
setDT(df_sample)[,
           list(param = sample(param,1)),by=list(site, quad)][
           setDT(df), on = c("site","quad")]
 #   site quad    param month
 #1:    1    1 5.235221     J
 #2:    1    2 4.914149     J
 #3:    1    3 6.845636     J
 #4:    1    4 2.697654     J
 #5:    1    5 4.506038     J
 #6:    2    1 5.361662     J
 #7:    2    2 4.058643     J
 #8:    2    3 6.400259     J
 ...

1
投票

您可以将

join
multiple = "any"
一起使用:

library(dplyr) #1.1.0 or above required
df %>% 
  left_join(df_sample, multiple = "any")

multiple = "all"
+
slice_sample
(因为文档没有说明
multiple = "any"
是采样),使用这个选项可能更安全:

df %>% 
  left_join(df_sample, multiple = "all") %>% 
  slice_sample(by = c(month, site, quad))

输出

Joining with `by = join_by(site, quad)`
   month site quad    param
1      J    1    1 5.235221
2      J    1    2 4.669264
3      J    1    3 4.688376
4      J    1    4 2.697654
5      J    1    5 4.829124
6      J    2    1 5.361662
7      J    2    2 5.346964
8      J    2    3 5.189737
9      J    2    4 4.840423
10     J    2    5 5.326549
11     J    3    1 3.824274
12     J    3    2 3.878784
13     J    3    3 3.638096
14     J    3    4 5.481125
15     J    3    5 5.741972
16     F    1    1 5.235221
17     F    1    2 4.669264
18     F    1    3 4.688376
19     F    1    4 2.697654
20     F    1    5 4.829124
21     M    2    1 5.361662
22     M    2    2 5.346964
23     M    2    3 5.189737
24     M    2    4 4.840423
25     M    2    5 5.326549
© www.soinside.com 2019 - 2024. All rights reserved.