我有两个数据集。
df_sample
包含一些测量数据 param
跨站点和象限(带重复)。我想用这个数据集来填充df
set.seed(111)
#This is the dataset I want to draw the sample from
site <- rep(c("1","2","3"), each = 20)
quad <- rep(c("1","2","3","4","5"), rep = 12)
param <- rnorm(60,5,1)
df_sample <- data.frame(site,quad, param)
#This is the dataset I want to add the sampling to
month <- rep(c("J","J","J","F","M"), each = 5)
site <- rep(c("1","2","3","1","2"), each = 5)
quad <- rep(c("1","2","3","4","5"), rep = 5)
df <- data.frame(month,site,quad)
请注意,第一个数据集仅包含在各个象限中多次测量参数的站点。现在,在
df
中,我想创建一个新的列参数。对于每个月和站点,此参数将仅从其相应的站点和象限中随机抽样。所以基本上,每个站点和象限都可以采用三个值之一。我怎样才能做到这一点?
df$param <- sample(df_sample$param)
As an example `df` could look like this
month site quad param
J 1 1 4.8236
J 1 2 3.502
J 1 3 6.84
...
在
df_sample
中添加一个 ID 列,将每个重复标记为 1、2 或 3。(我称之为 run
。)在 run
中添加相应的随机选择的 df
列。然后加入:
library(dplyr)
df_sample = df_sample %>%
group_by(site, quad) %>%
mutate(run = row_number()) %>%
ungroup()
df %>%
mutate(run = sample(1:3, size = n(), replace = TRUE)) %>%
left_join(df_sample)
# Joining, by = c("site", "quad", "run")
# month site quad run param
# 1 J 1 1 2 5.140278
# 2 J 1 2 2 3.502573
# 3 J 1 3 1 4.688376
# 4 J 1 4 1 2.697654
# 5 J 1 5 3 5.797529
# 6 J 2 1 2 5.598254
# 7 J 2 2 2 3.158466
# 8 J 2 3 2 7.718056
# 9 J 2 4 3 3.379530
# 10 J 2 5 3 2.734004
# 11 J 3 1 1 3.824274
# 12 J 3 2 2 5.331380
# 13 J 3 3 3 5.914242
# 14 J 3 4 3 5.358625
# 15 J 3 5 3 5.175096
# 16 F 1 1 2 5.140278
# 17 F 1 2 1 4.669264
# 18 F 1 3 3 6.845636
# 19 F 1 4 1 2.697654
# 20 F 1 5 2 4.506038
# 21 M 2 1 3 1.886783
# 22 M 2 2 1 5.346964
# 23 M 2 3 2 7.718056
# 24 M 2 4 2 5.191244
# 25 M 2 5 2 3.698704
与
data.table
:
set.seed(1)
setDT(df_sample)[,
list(param = sample(param,1)),by=list(site, quad)][
setDT(df), on = c("site","quad")]
# site quad param month
#1: 1 1 5.235221 J
#2: 1 2 4.914149 J
#3: 1 3 6.845636 J
#4: 1 4 2.697654 J
#5: 1 5 4.506038 J
#6: 2 1 5.361662 J
#7: 2 2 4.058643 J
#8: 2 3 6.400259 J
...
您可以将
join
与 multiple = "any"
一起使用:
library(dplyr) #1.1.0 or above required
df %>%
left_join(df_sample, multiple = "any")
或
multiple = "all"
+ slice_sample
(因为文档没有说明multiple = "any"
是采样),使用这个选项可能更安全:
df %>%
left_join(df_sample, multiple = "all") %>%
slice_sample(by = c(month, site, quad))
输出
Joining with `by = join_by(site, quad)`
month site quad param
1 J 1 1 5.235221
2 J 1 2 4.669264
3 J 1 3 4.688376
4 J 1 4 2.697654
5 J 1 5 4.829124
6 J 2 1 5.361662
7 J 2 2 5.346964
8 J 2 3 5.189737
9 J 2 4 4.840423
10 J 2 5 5.326549
11 J 3 1 3.824274
12 J 3 2 3.878784
13 J 3 3 3.638096
14 J 3 4 5.481125
15 J 3 5 5.741972
16 F 1 1 5.235221
17 F 1 2 4.669264
18 F 1 3 4.688376
19 F 1 4 2.697654
20 F 1 5 4.829124
21 M 2 1 5.361662
22 M 2 2 5.346964
23 M 2 3 5.189737
24 M 2 4 4.840423
25 M 2 5 5.326549