我运行此代码来获取样本中的种族细分:
dataset %>%
group_by(ethnicity) %>%
summarise(percent = 100 * n()/nrow(datset))
但是,由于受试者可以在问卷中选择多个种族类别,因此结果如下:
1 "[\"Aboriginal or Torres Strait Islander\",\"Caucasian\",\"Asian (inc. Indian subcontinent)\"]" 0.364
2 "[\"Aboriginal or Torres Strait Islander\",\"Caucasian\"]" 0.0910
3 "[\"Aboriginal or Torres Strait Islander\"]" 0.910
4 "[\"African\"]" 0.637
5 "[\"Asian (inc. Indian subcontinent)\"]" 0.0910
9 "[\"Caucasian\",\"Latino/Hispanic\"]" 0.182
10 "[\"Caucasian\",\"Middle Eastern\"]" 0.273
11 "[\"Caucasian\",\"Not listed\"]" 0.182
等等
获取各个(非组合)类别细分的最佳/最有效方法是什么?
我基本上只想要以下的百分比细分:
Caucausian -
African -
Latino/Hispanic -
Aboriginal or Torres Strait Islander -
Middle Eastern -
等等
library(tidyverse)
mydf <- data.frame(id = c(1:5, 9:11),
eth = c("[\"Aboriginal or Torres Strait Islander\",\"Caucasian\",\"Asian (inc. Indian subcontinent)\"]",
"[\"Aboriginal or Torres Strait Islander\",\"Caucasian\"]" ,
"[\"Aboriginal or Torres Strait Islander\"]" ,
"[\"African\"]" ,
"[\"Asian (inc. Indian subcontinent)\"]" ,
"[\"Caucasian\",\"Latino/Hispanic\"]" ,
"[\"Caucasian\",\"Middle Eastern\"]" ,
"[\"Caucasian\",\"Not listed\"]" )
)
mydf |>
separate_longer_delim(eth, ",") |>
mutate(eth = str_remove_all(eth, "\\[|\\]")) |>
count(eth) |>
mutate(pct = n / nrow(mydf)) |>
arrange(desc(pct))
eth n pct
1 "Caucasian" 5 0.625
2 "Aboriginal or Torres Strait Islander" 3 0.375
3 "Asian (inc. Indian subcontinent)" 2 0.250
4 "African" 1 0.125
5 "Latino/Hispanic" 1 0.125
6 "Middle Eastern" 1 0.125
7 "Not listed" 1 0.125