基于两个总体基准分布的分层抽样

Question

我无法找出现有的方法或编写新代码来针对两种不同的人口基准分布使用数据集中的分层抽样

framework

。由于我不能 100% 确定我使用了正确的术语，因此我将用一个简化的示例更具体地解释：

我有一个数据集，其中包含我知道其性别和教育水平的小组成员，以及抽样

framework

。我想使用分层抽样从那里抽取样本。我知道

gender

的人口分布，以及

education

的分布，但不知道联合分布（而且我不愿意假设教育在性别之间分布相同）。使用分层抽样，我希望最终得到一个在这两个基准上（大致）具有代表性的样本。

我提供了下面的代码来展示如何在一个分布上进行采样（

gender

）。我知道抽样包中存在简化分层抽样的功能，但据我了解，它们不具备两个边际分布的分层抽样功能。

library(dplyr)

N = 1000 # framework size
n = 300 # sample size

# create sampling framework
framework = data.frame(id        = seq(1:N),
                       gender    = sample(c("M","F"), N, replace = TRUE, prob = c(0.3, 0.7)),
                       education = sample(c("1. Low", "2. Mid", "3. High"), N, replace = TRUE, prob = c(0.2, 0.3, 0.5)))

# create population benchmarks
pop_gender    = data.frame(gender    = c("M", "F"),
                           prop      = c(0.5, 0.5))
pop_education = data.frame(education = c("1. Low", "2. Mid", "3. High"),
                           prop      = c(0.4, 0.3, 0.3))

# loop through strata (in this case just M/F) and select sample
selected = NA # empty selection vector
for(i in pop_gender$gender){
  # subset framework to stratum
  framework_sel = framework %>%
    filter(gender == i) 
  
  # select sample from stratum
  selected_i = sample(framework_sel$id, # sample from ids
                      n*pop_gender$prop[pop_gender$gender == i], # sample size within stratum
                      replace = FALSE)
  selected = c(selected, selected_i)
}

# pull sample from framework
sample = framework %>%
  filter(id %in% selected)

# compare sample to population
prop.table(table(sample$gender))
prop.table(table(sample$education))

需要明确的是：我希望最终得到一个在性别和教育方面都与人口相匹配的样本。

我很感激任何见解！

我没有在这个简化示例中包含的另一个问题是，框架很可能在某些层中没有足够的人员来采样到预期的层样本大小。

Answer 1

下面的代码样本来自

framework

，按性别和教育程度分层。
样本在性别和教育方面与人口“不匹配”。它将尝试匹配 pop_gender 和

pop_education

中的概率。但即便如此，最终结果仍然存在随机性，并不完全是想要的结果。

N <- 1000 # framework size
n <- 300 # sample size

# create population benchmarks
pop_gender    = data.frame(gender    = c("M", "F"),
                           prop      = c(0.5, 0.5))
pop_education = data.frame(education = c("1. Low", "2. Mid", "3. High"),
                           prop      = c(0.4, 0.3, 0.3))

library(sampling)

# make results reproducible, the code below
# uses randomness twice, one in the call to 
# stats::r2dtable and the other in the call
# to sampling::strata
set.seed(2023)

# total to sample by gender and education
marg_gender <- pop_gender$prop * n
marg_education <- pop_education$prop * n

# account for the possibility that the framework does not
# have sufficient people in some strata to be sampled 
# to the intended stratum sample size.
tbl <- table(framework[-1])
marg_gender <- pmin(marg_gender, rowSums(tbl))
marg_education <- pmin(marg_education, colSums(tbl))

# Random 2-way table with given marginals
sample_sizes <- r2dtable(1L, marg_gender, marg_education) |> unlist()

# stratified sampling without replacement
s <- strata(framework, c("gender", "education"), size = sample_sizes, method = "srswor")
# extract the sampled rows to a new data.frame
sample2 <- getdata(framework, s)

# see the results, the final proportions are
# not exactly the wanted proportions

# first gender
cbind(
  wanted = pop_gender$prop, 
  prop = prop.table(table(sample2$gender)) |> round(2)
)
#>   wanted prop
#> F    0.5 0.55
#> M    0.5 0.45

# and education
cbind(
  wanted = pop_education$prop, 
  prop = prop.table(table(sample2$education)) |> round(2)
)
#>         wanted prop
#> 1. Low     0.4 0.30
#> 2. Mid     0.3 0.36
#> 3. High    0.3 0.35

head(sample2)
#>      id gender education ID_unit     Prob Stratum
#> 10   10      F    2. Mid      10 0.328125       1
#> 101 101      F    2. Mid     101 0.328125       1
#> 116 116      F    2. Mid     116 0.328125       1
#> 120 120      F    2. Mid     120 0.328125       1
#> 138 138      F    2. Mid     138 0.328125       1
#> 146 146      F    2. Mid     146 0.328125       1

创建于 2023-11-18，使用

reprex v2.0.2

数据

framework

使用

set.seed

创建，使结果可重现。

N <- 1000 # framework size
n <- 300 # sample size

prob_gender <- c(0.3, 0.7)
prob_education <- c(0.2, 0.3, 0.5)

# create sampling framework
set.seed(1)
framework = data.frame(id        = seq(1:N),
                       gender    = sample(c("M","F"), N, replace = TRUE, prob = prob_gender),
                       education = sample(c("1. Low", "2. Mid", "3. High"), N, 
                                          replace = TRUE, prob = prob_education))

创建于 2023-11-18，使用

reprex v2.0.2

基于两个总体基准分布的分层抽样

问题描述投票：0回答：1

1个回答

最新问题

基于两个总体基准分布的分层抽样

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1