使用权重进行采样，并用sample_n（）函数替换

Question

全部，

我有一个dplyr sample_n()问题。我在尝试使用weight选项时尝试更换，但似乎遇到了障碍。即，替换抽样总是使一个组过度抽样。在不进行替换的情况下采样时不会出现问题，但如果可以的话，我真的很想在进行替换时进行采样。

这是一个最小的工作示例，它使用来自apistrat程序包的熟悉的apipop和survey数据。 R中的调查研究人员非常了解这些数据。在人口数据（apipop）中，小学（stype == E）约占所有学校的71.4％。中学（stype == M）约占所有学校的12.2％，高中（stype == H）约占所有学校的16.4％。 apistrat有故意的失衡，其中小学占数据的50％，而中学和高中分别占200行样本的25％。

[我想做的是使用apistrat功能对sample_n()数据进行采样，并进行替换。但是，我似乎一直在对小学过度采样，而对初中和高中采样不足。这是R代码中的最小工作示例。请原谅我的cornball循环代码。我知道我需要在purrr上有所进步，但我还没有到那儿。：P

library(survey)
library(tidyverse)

apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop

pop %>%
  group_by(stype) %>% 
  summarize(prop = n()/6194) -> Census

Census
# p(E) = ~.714
# p(H) = ~.122
# p(M) = ~.164

strat %>%
  left_join(., Census) -> strat

# Sampling with replacement seems to consistently oversample E and undersample H and M.
with_replace <- tibble()
set.seed(8675309) # Jenny, I got your number...

for (i in 1:1000) {
strat %>%
    sample_n(100, replace=T, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
with_replace <- bind_rows(with_replace, hold_this)

}

# group_by means with 95% intervals
with_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))

# ^ consistently oversampled E.
# meanprop of E = ~.835.
# meanprop of H = ~.070 and meanprop of M = ~.095
# 95% intervals don't include true probability for either E, H, or M.

# Sampling without replacement doesn't seem to have this same kind of sampling problem.
wo_replace <- tibble()
set.seed(8675309)  # Jenny, I got your number...

for (i in 1:1000) {
  strat %>%
    sample_n(100, replace=F, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
  wo_replace <- bind_rows(wo_replace, hold_this)

}

# group_by means with 95% intervals
wo_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))


# ^ better in orbit of the true probability
# meanprob of E = ~.757. meanprob of H = ~.106. meanprob of M = ~.137
# 95% intervals include true probability as well.

我不确定这是否是dplyr（v。0.8.3）问题。 95％的替换抽样间隔不包括真实的概率，每个样本（要达到峰值）始终在80年代中期的范围内进行小学抽样。在1,000个样本中（替换后）只有3个样本的小学少于100行样本的72％。那是一致的。我很好奇，如果有人在这里对正在发生的事情有任何见解，或者可能是我可能做错了什么，并且如果我误解了sample_n()的功能。

提前感谢。

Answer 1

sample_n()中的dplyr功能是base::sample.int()的包装。看base::sample.int() －实际的功能是用C实现的。我们可以看到问题出在源头：

rows <- sample(nrow(strat), size = 100, replace=F, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
  stype     n
  <fct> <int>
1 E        74
2 H        14
3 M        12

rows <- sample(nrow(strat), size = 100, replace=T, prob = strat$prop)
strat[rows, ] %>% count(stype)
# A tibble: 3 x 2
  stype     n
  <fct> <int>
1 E        85
2 H         8
3 M         7

老实说，我不完全确定为什么会这样，但是如果您使概率总和为1并使它们在组内一致，那么它给出了预期的样本量：

library(tidyverse)
library(survey)

data(api)

apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop

pop %>%
  group_by(stype) %>% 
  summarize(prop = n()/6194) -> Census


strat %>%
  left_join(., Census) -> strat
#> Joining, by = "stype"

set.seed(8675309) # Jenny, I got your number...
with_replace <- tibble()

for (i in 1:1000) {
  strat %>%
    group_by(stype) %>%
    mutate(per_prob = sample(prop/n())) %>% 
    ungroup() %>% 
    sample_n(100, replace=T, weight = per_prob) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
  with_replace <- bind_rows(with_replace, hold_this)

}

with_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))
#> # A tibble: 3 x 4
#>   stype meanprop   lwr   upr
#>   <fct>    <dbl> <dbl> <dbl>
#> 1 E        0.713  0.63  0.79
#> 2 H        0.123  0.06  0.19
#> 3 M        0.164  0.09  0.24

^{由reprex package（v0.3.0）在2020-04-17创建}

我猜想这与p的向量中的实体没有被replace = TRUE减少有关，但实际上我不知道幕后发生了什么。有C语言知识的人应该看看！

使用权重进行采样，并用sample_n（）函数替换

问题描述投票：2回答：1

1个回答

最新问题

使用权重进行采样，并用sample_n（）函数替换

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1