如何使用for循环或purrr包来简化R中的重复过程

问题描述 投票:0回答:1

假设我对 1000 人进行了一项调查,询问他们的年龄、地点和 10 个问题,格式为“全选适用”(例如,您喜欢我们品牌的产品 #1 的原因是什么?全选适用),每个人问题有 10 个选项(1. 它看起来很时尚,2. 它很有用,3. 它价格合理,等等)这是创建模拟数据集的代码。

library(tidyverse)
library(writexl)

set.seed(2)

record_id<- c(1:1000)
age<- as.factor(sample (c("teens", "young adults"), replace = T, size = 1000))
region<- as.factor(sample (c("rural", "urban"), replace = T, size = 1000))

N<- 1000
xy <- matrix(NA, nrow = N, ncol= 100)
for (i in 1:N) {
  xy[i, ] <-  as.factor(sample(c("checked","unchecked"), replace = T, size = 100))
}

xy<- data.frame(xy)

num_questions <- 10
items_per_question <- 10

# Initialize an empty vector to store column names
col_names <- character(0)

# Loop through each group and item, generating column names
for (question in 1:num_questions) {
  for (item in 1:items_per_question) {
    col_name <- paste0("q", question, "_", item)
    col_names <- c(col_names, col_name)
  }
}

colnames(xy) <- col_names
mydf<- data.frame(record_id, age, region, xy)  

# generating some NA to make the dataset realistic 
set.seed(2)
mydf<- mydf %>% mutate(
  across(.cols= (c(4:100)),
         .fns = ~if_else(rbinom(n(), 1, 0.04) == 1L, NA, .x))
)  

mydf[1:5, 1:5]

#  record_id          age region      q1_1      q1_2
#1         1        teens  urban   checked   checked
#2         2        teens  rural unchecked unchecked
#3         3 young adults  urban   checked unchecked
#4         4 young adults  urban   checked unchecked
#5         5 young adults  rural unchecked unchecked

我的目标是创建频率表,其中包含每个问题每个项目的百分比,其中分母是回答问题的参与者数量(即,谁检查了该问题下面列出的至少一项)。

另外,因为同事想在excel中制作图表,最后需要将表格导出到excel文件。

这是我实现该目标的尝试。

mydf1_long<- mydf %>%  select(record_id,age, region,  starts_with("q1_")) %>%  
  pivot_longer(-c(1:3), names_to = "item")


mydf1_long %>% filter(value == "checked") %>%  distinct(record_id) %>% 
  count(name = "den") %>% 
  cbind(mydf1_long %>% 
          filter(value == "checked") %>% 
          count(item, name = "num")) %>% arrange(num) %>% 
  mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>%  write_xlsx("q1.xlsx")


mydf1_long %>% filter(value == "checked") %>%  distinct(record_id, age) %>% 
  count(age, name = "den") %>% 
  right_join(mydf1_long %>% 
               filter(value == "checked") %>% 
               count(age, item, name = "num")) %>% group_by(age) %>% arrange(num, .by_group = TRUE) %>% 
  mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>%  write_xlsx("q1_age.xlsx")


mydf1_long %>% filter(value == "checked") %>%  distinct(record_id, region) %>% 
  count(region, name = "den") %>% 
  right_join(mydf1_long %>% 
               filter(value == "checked") %>% 
               count(region, item, name = "num")) %>% group_by(region) %>% arrange(num, .by_group = TRUE) %>% 
  mutate(perc = round((num / den) , 2)) %>% select(-num, -den) %>%  write_xlsx("q1_region.xlsx")

我必须复制并粘贴 q2~ q10 的代码。我想知道我是否可以循环它或映射它而不必经历这些。有人可以帮助我吗?

r function for-loop purrr data-wrangling
1个回答
0
投票

如果我理解正确,这可能就是您正在寻找的。首先

pivot_longer
separate
回答选项中的问题,然后使用
any
filter
剔除那些没有回答的人,最后
summarize
获取百分比并
pivot_wider
制作表格:

library(dplyr)
library(tidyr)

mydf %>% 
  pivot_longer(starts_with("q"), names_to = "question") %>%
  separate(question, into = c("question", "response_option"), sep = "_") %>%
  filter(any(value == 1), .by = c(record_id, question)) %>%
  summarize(perc = sum(value == 1, na.rm = TRUE) / n(), 
            .by = c(question, response_option)) %>%
  pivot_wider(names_from = response_option, values_from = perc, names_glue = "option_{response_option}")

输出:

   question option_1 option_2 option_3 option_4 option_5 option_6 option_7 option_8 option_9 option_10
   <chr>       <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>     <dbl>
 1 q1          0.473    0.446    0.495    0.477    0.475    0.493    0.471    0.509    0.496     0.490
 2 q2          0.467    0.475    0.501    0.482    0.498    0.471    0.472    0.485    0.485     0.488
 3 q3          0.470    0.483    0.498    0.473    0.484    0.490    0.473    0.453    0.486     0.486
 4 q4          0.461    0.473    0.498    0.462    0.482    0.464    0.458    0.482    0.466     0.488
 5 q5          0.471    0.504    0.494    0.482    0.475    0.480    0.511    0.467    0.453     0.497
 6 q6          0.494    0.474    0.465    0.481    0.498    0.472    0.505    0.495    0.462     0.478
 7 q7          0.495    0.490    0.483    0.493    0.476    0.474    0.495    0.474    0.465     0.489
 8 q8          0.491    0.484    0.469    0.479    0.495    0.464    0.495    0.503    0.473     0.474
 9 q9          0.457    0.472    0.452    0.488    0.451    0.479    0.474    0.469    0.473     0.471
10 q10         0.478    0.469    0.484    0.489    0.481    0.495    0.454    0.501    0.487     0.506

请注意,在您的示例数据中,它没有产生“已检查”和“未检查”,而是产生了 1 和 2 - 我假设“1 == 检查”

© www.soinside.com 2019 - 2024. All rights reserved.