我有一个大型数据集,我希望通过这个数据集获得一列的汇总估计值(平均值,中位数,计数等)。
尝试使用purrr
很难找到如何做到这一点 - 希望能够让这个工作流程点击以备将来的项目......但是非常困难。
作为一个可重复的示例,这适用于am
和vs
的分组,并估计mpg
的汇总值
library(tidyverse)
library(rlang)
mtcars %>%
group_by(am, vs) %>%
summarise(mean_mpg = mean(mpg),
median_mpg = median(mpg),
count = n())
但是,为了扩展这个例子,我想要为am
和vs
分组;然后am
和gear
;然后am
和carb
。直觉上,这似乎是map
应该处理的事情。
group_vars <- c("vs", "gear", "carb")
group_syms <- rlang::syms(group_vars)
sym_am <- rlang::sym("am")
mtcars %>%
map_df(~group_by(!!sym_am, !!!group_syms) %>%
summarise(mean_mpg = mean(mpg),
summarise(median_mpg = median(mpg),
summarise(count = n())
)
#Error in !sym_am : invalid argument type
我们可以使用来自map2
的purrr
来使用多个符号作为参数,然后在group_by
和summarise
中对其进行评估输出
library(tidyverse)
map2_df(list(sym_am), group_syms, ~ mtcars %>%
group_by(!!.x, !!.y) %>%
summarise(mean_mgp = mean(mpg), median_mpg = median(mpg),count = n()))
这是一种方法
library(tidyverse)
variable_grp <- c("vs", "gear", "carb")
constant_grp <- c("am")
group_vars <- lapply(variable_grp, function(i) c(constant_grp, i))
map(group_vars, ~group_by_at(mtcars, .x) %>%
summarise( mean_mgp = mean(mpg),
median_mpg = median(mpg),
count = n()))
这将生成每个组的摘要统计信息列表。使用map_df
解决问题的一个问题是每个组的列名都不同(第一组:am,vs;第二组:am,gear ...)。因此,如果您使用的是variable_column
,则需要重命名map_df
map_df(group_vars, ~group_by_at(mtcars, .x) %>%
summarise( mean_mgp = mean(mpg),
median_mpg = median(mpg),
count = n()) %>%
setNames(c("am", "variable_column", "mean_mpg", "median_mpg", "count")))
# A tibble: 17 x 5
# Groups: am [2]
# am variable_column mean_mpg median_mpg count
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 0 0 15.05000 15.20 12
# 2 0 1 20.74286 21.40 7
# 3 1 0 19.75000 20.35 6
# 4 1 1 28.37143 30.40 7
# 5 0 3 16.10667 15.50 15
# 6 0 4 21.05000 21.00 4
# 7 1 4 26.27500 25.05 8
# 8 1 5 21.38000 19.70 5
# 9 0 1 20.33333 21.40 3
# 10 0 2 19.30000 18.95 6
# 11 0 3 16.30000 16.40 3
# 12 0 4 14.30000 14.30 7
# 13 1 1 29.10000 29.85 4
# 14 1 2 27.05000 28.20 4
# 15 1 4 19.26667 21.00 3
# 16 1 6 19.70000 19.70 1
# 17 1 8 15.00000 15.00 1
你可以使用variable_column
的.id
参数和post-map_df map_df
保存mutate
名称
map_df(group_vars, ~group_by_at(mtcars, .x) %>%
summarise( mean_mgp = mean(mpg),
median_mpg = median(mpg),
count = n()) %>%
setNames(c("am", "variable_column", "mean_mpg", "median_mpg", "count")),
.id="variable_col_name") %>%
mutate(variable_col_name = variable_grp[as.numeric(variable_col_name)])
# A tibble: 17 x 6
# Groups: am [2]
# variable_col_name am variable_column mean_mpg median_mpg count
# <chr> <dbl> <dbl> <dbl> <dbl> <int>
# 1 vs 0 0 15.05000 15.20 12
# 2 vs 0 1 20.74286 21.40 7
# 3 vs 1 0 19.75000 20.35 6
# 4 vs 1 1 28.37143 30.40 7
# 5 gear 0 3 16.10667 15.50 15
# 6 gear 0 4 21.05000 21.00 4
# 7 gear 1 4 26.27500 25.05 8
# 8 gear 1 5 21.38000 19.70 5
# 9 carb 0 1 20.33333 21.40 3
# 10 carb 0 2 19.30000 18.95 6
# 11 carb 0 3 16.30000 16.40 3
# 12 carb 0 4 14.30000 14.30 7
# 13 carb 1 1 29.10000 29.85 4
# 14 carb 1 2 27.05000 28.20 4
# 15 carb 1 4 19.26667 21.00 3
# 16 carb 1 6 19.70000 19.70 1
# 17 carb 1 8 15.00000 15.00 1