模拟数据样本

问题描述 投票:0回答:1

我对每个组具有以下概率,并且每个组代表一定范围的值。我的目标是模拟与组和​​百分比相对应的1,234行数据:

ages = c(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99)
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01) / 100

age_bins = sapply(list(21:29, 30:39,40:49, 50:59, 60:69, 70:79, 80:89, 90:99), length)
age_weighted = rep(age_probs/age_bins, age_bins)

set.seed(1)
n = 1234
data = data.frame(ID = sample(n),
                  Age = sample(ages, size = n, prob = age_weighted, replace = TRUE))

但是,数据的百分比不匹配,并且有时差异太大(我认为是因为数据不够大)。我发现了另一个post, which mentions that this happens because this, our "view" of the randomness is effectively "one cell at a time", instead of "one column at a time".,这是参考sample()函数。

如何更改样本函数以更好地表示人口百分比?

哦,这是我检查数据框列的方式

to_export = data[order(data$ID),]


for (i in (1:length(to_export$Age))) {
  if (to_export$Age[i] >= 21 & to_export$Age[i] <= 29) to_export$block[i] = "21-29"
  if (to_export$Age[i] >= 30 & to_export$Age[i] <= 39) to_export$block[i] = "30-39"
  if (to_export$Age[i] >= 40 & to_export$Age[i] <= 49) to_export$block[i] = "40-49"
  if (to_export$Age[i] >= 50 & to_export$Age[i] <= 59) to_export$block[i] = "50-59"
  if (to_export$Age[i] >= 60 & to_export$Age[i] <= 69) to_export$block[i] = "60-69"
  if (to_export$Age[i] >= 70 & to_export$Age[i] <= 79) to_export$block[i] = "70-79"
  if (to_export$Age[i] >= 80 & to_export$Age[i] <= 89) to_export$block[i] = "80-89"
  if (to_export$Age[i] >= 90) to_export$block[i] = "90+"

}

#to_export

age_table = to_export %>% group_by(block) %>% summarise(percentage = round(n()/1234 * 100,2))

age_table
r statistics probability sampling
1个回答
0
投票

我建议进行重新设计。我正在使用dplyrggplot,但基本上不需要它们:

set.seed(1)
n = 1234

# Definition of the age buckets
ages = c("21:29", "30:39","40:49", "50:59", "60:69", "70:79", "80:89", "90:99")

# probability for each bucket
age_probs = c(10.85,12.64,14.02,25.00,19.01,11.45,7.01,0.01)

# normalise the probabilities since they don't add up to 1
c_age_probs = cumsum(age_probs)/sum(age_probs)

# create the data.frame
data = data.frame(ID = 1:n,
                  Age = ages[findInterval(runif(n), c_age_probs) + 1])

# plotting the data
ggplot(data, aes(x=Age)) + 
  geom_bar()

根据给定的概率,数据图看起来还不错。让我们看一下百分比:

# getting the percentage
data %>%
  group_by(Age) %>%
  summarise(percentage = n()/n)

#   A tibble: 7 x 2
#   Age   percentage
#   <chr>      <dbl>
# 1 21:29     0.0989
# 2 30:39     0.105 
# 3 40:49     0.133 
# 4 50:59     0.269 
# 5 60:69     0.198 
# 6 70:79     0.126 
# 7 80:89     0.0705

关键部分是ages[findInterval(runif(n), c_age_probs) + 1]。我创建了一些统一的分布数字,并使用累积(和归一化)概率来获得相应的年龄段。这样,我什至不需要创建多个case_when语句。

© www.soinside.com 2019 - 2024. All rights reserved.