假设我对 1000 人进行了一项调查,询问他们的年龄、地点和 10 个问题,格式为“全选适用”(例如,您喜欢我们品牌的产品 #1 的原因是什么?全选适用),每个人问题有 10 个选项(1. 它看起来很时尚,2. 它很有用,3. 它价格合理,等等)这是创建模拟数据集的代码。
library(tidyverse)
library(writexl)
set.seed(2)
record_id<- c(1:1000)
age<- as.factor(sample (c("teens", "young adults"), replace = T, size = 1000))
region<- as.factor(sample (c("rural", "urban"), replace = T, size = 1000))
N<- 1000
xy <- matrix(NA, nrow = N, ncol= 100)
for (i in 1:N) {
xy[i, ] <- as.factor(sample(c("checked","unchecked"), replace = T, size = 100))
}
xy<- data.frame(xy)
num_questions <- 10
items_per_question <- 10
# Initialize an empty vector to store column names
col_names <- character(0)
# Loop through each group and item, generating column names
for (question in 1:num_questions) {
for (item in 1:items_per_question) {
col_name <- paste0("q", question, "_", item)
col_names <- c(col_names, col_name)
}
}
colnames(xy) <- col_names
mydf<- data.frame(record_id, age, region, xy)
# generating some NA to make the dataset realistic
set.seed(2)
mydf<- mydf %>% mutate(
across(.cols= (c(4:100)),
.fns = ~if_else(rbinom(n(), 1, 0.04) == 1L, NA, .x))
)
mydf[1:5, 1:5]
# record_id age region q1_1 q1_2
#1 1 teens urban checked checked
#2 2 teens rural unchecked unchecked
#3 3 young adults urban checked unchecked
#4 4 young adults urban checked unchecked
#5 5 young adults rural unchecked unchecked
我的目标是创建频率表,其中包含每个问题每个项目的百分比,其中分母是回答问题的参与者数量(即,谁检查了该问题下面列出的至少一项)。
另外,因为同事想在excel中制作图表,最后需要将表格导出到excel文件。
这是我实现该目标的尝试。
mydf1_long<- mydf %>% select(record_id,age, region, starts_with("q1_")) %>%
pivot_longer(-c(1:3), names_to = "item")
mydf1_long %>% filter(value == "checked") %>% distinct(record_id) %>%
count(name = "den") %>%
cbind(mydf1_long %>%
filter(value == "checked") %>%
count(item, name = "num")) %>% arrange(num) %>%
mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>% write_xlsx("q1.xlsx")
mydf1_long %>% filter(value == "checked") %>% distinct(record_id, age) %>%
count(age, name = "den") %>%
right_join(mydf1_long %>%
filter(value == "checked") %>%
count(age, item, name = "num")) %>% group_by(age) %>% arrange(num, .by_group = TRUE) %>%
mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>% write_xlsx("q1_age.xlsx")
mydf1_long %>% filter(value == "checked") %>% distinct(record_id, region) %>%
count(region, name = "den") %>%
right_join(mydf1_long %>%
filter(value == "checked") %>%
count(region, item, name = "num")) %>% group_by(region) %>% arrange(num, .by_group = TRUE) %>%
mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>% write_xlsx("q1_region.xlsx")
我必须复制并粘贴 q2~ q10 的代码。我想知道我是否可以循环它或映射它而不必经历这些。有人可以帮助我吗?
如果我理解正确,这可能就是您正在寻找的。首先
pivot_longer
和 separate
回答选项中的问题,然后使用 any
filter
剔除那些没有回答的人,最后 summarize
获取百分比并 pivot_wider
制作表格:
library(dplyr)
library(tidyr)
mydf %>%
pivot_longer(starts_with("q"), names_to = "question") %>%
separate(question, into = c("question", "response_option"), sep = "_") %>%
filter(any(value == 1), .by = c(record_id, question)) %>%
summarize(perc = sum(value == 1, na.rm = TRUE) / n(),
.by = c(question, response_option)) %>%
pivot_wider(names_from = response_option, values_from = perc, names_glue = "option_{response_option}")
输出:
question option_1 option_2 option_3 option_4 option_5 option_6 option_7 option_8 option_9 option_10
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 q1 0.473 0.446 0.495 0.477 0.475 0.493 0.471 0.509 0.496 0.490
2 q2 0.467 0.475 0.501 0.482 0.498 0.471 0.472 0.485 0.485 0.488
3 q3 0.470 0.483 0.498 0.473 0.484 0.490 0.473 0.453 0.486 0.486
4 q4 0.461 0.473 0.498 0.462 0.482 0.464 0.458 0.482 0.466 0.488
5 q5 0.471 0.504 0.494 0.482 0.475 0.480 0.511 0.467 0.453 0.497
6 q6 0.494 0.474 0.465 0.481 0.498 0.472 0.505 0.495 0.462 0.478
7 q7 0.495 0.490 0.483 0.493 0.476 0.474 0.495 0.474 0.465 0.489
8 q8 0.491 0.484 0.469 0.479 0.495 0.464 0.495 0.503 0.473 0.474
9 q9 0.457 0.472 0.452 0.488 0.451 0.479 0.474 0.469 0.473 0.471
10 q10 0.478 0.469 0.484 0.489 0.481 0.495 0.454 0.501 0.487 0.506
请注意,在您的示例数据中,它没有产生“已检查”和“未检查”,而是产生了 1 和 2 - 我假设“1 == 检查”