我有一个数据框,它封装了考试的一些统计数据,跟踪了不同的年份和组别。我想构建一个函数,它添加新列,从参考年份的动态供应列表中为每个组提供这些统计数据的变化。
这是我想要的输出示例。
grades <- data.frame(
Group = c(rep("A", 4), rep("B", 4)),
Year = rep(seq(2015, 2018), 2),
Mean = c(seq(100, 130, 10), seq(200, 260, 20)),
PassR = c(seq(0.5, 0.53, 0.01), seq(0.6, 0.66, 0.02))
)
grades |> group_by(Group) |> calculateDifferences(c(2015, 2016))
# A tibble: 8 × 8
# Groups: Group [2]
Group Year Mean PassR Mean_Diff2015 Mean_Diff2016 PassR_Diff2015 PassR_Diff2016
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2015 100 0.5 0 -10 0 -0.0100
2 A 2016 110 0.51 10 0 0.0100 0
3 A 2017 120 0.52 20 10 0.0200 0.0100
4 A 2018 130 0.53 30 20 0.0300 0.0200
5 B 2015 200 0.6 0 -20 0 -0.0200
6 B 2016 220 0.62 20 0 0.0200 0
7 B 2017 240 0.64 40 20 0.0400 0.0200
8 B 2018 260 0.66 60 40 0.0600 0.0400
我最好的尝试是以下函数,但它遇到了列表中 Year 列的范围问题。
# Calculate differences from the given year for both mean and pass rate
calculateDifferences <- function(data, diffYears) {
mutate(data,
across(
any_of(c("Mean", "PassR")),
#list(Diff2015 = function(col) col - col[Year == 2015],
# Diff2016 = function(col) col - col[Year == 2016]),
map(as.list(diffYears), function(year) { function(col) col - col[Year == year] }) |>
set_names(str_c("Diff", diffYears)),
.names = "{.col}_{.fn}"
)
)
}
运行这段代码抱怨找不到对象
Year
。我尝试引入一些 NSE 来延迟对变量的评估,但是 !!substitute("Year")
和 !!quo("Year")
都不会产生所需的输出:它只是作为 dplyr::mutate_incompatible_size <named_list>
错误抛出。尝试用 .data[["Year"]]
替换它会抱怨它不在数据屏蔽上下文中。
如果我对年份进行硬编码(如函数的注释部分),它会正确运行并产生所需的输出,但它无法适应动态提供的年份列表。
我可以尝试用
data[["Year"]]
单独拉动Year列。如果数据未分组,此方法效果很好,但如果数据已分组,则效果不佳。
使用
cur_data()
访问当前组的数据:
library(dplyr)
library(purrr)
library(stringr)
calculateDifferences <- function(data, diffYears) {
mutate(data,
across(
any_of(c("Mean", "PassR")),
map(as.list(diffYears), function(year) { function(col) col - col[pick(Year)$Year == year] }) |>
set_names(str_c("Diff", diffYears)),
.names = "{.col}_{.fn}"
)
)
}
grades <- data.frame(
Group = c(rep("A", 4), rep("B", 4)),
Year = rep(seq(2015, 2018), 2),
Mean = c(seq(100, 130, 10), seq(200, 260, 20)),
PassR = c(seq(0.5, 0.53, 0.01), seq(0.6, 0.66, 0.02))
)
grades |> group_by(Group) |> calculateDifferences(c(2015, 2016))
# A tibble: 8 × 8
# Groups: Group [2]
Group Year Mean PassR Mean_Diff2015 Mean_Diff2016 PassR_Diff2015 PassR_Diff2016
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2015 100 0.5 0 -10 0 -0.0100
2 A 2016 110 0.51 10 0 0.0100 0
3 A 2017 120 0.52 20 10 0.0200 0.0100
4 A 2018 130 0.53 30 20 0.0300 0.0200
5 B 2015 200 0.6 0 -20 0 -0.0200
6 B 2016 220 0.62 20 0 0.0200 0
7 B 2017 240 0.64 40 20 0.0400 0.0200
8 B 2018 260 0.66 60 40 0.0600 0.0400
我不清楚为什么它能够找到
cur_data()$Year
,但是.data[["Year"]]
或者只是Year
.
这是一种替代方法,它依赖于返回内部的 tibble,然后使用
.unpack
参数将其解包。我已经更改了函数,以便可以将变量作为参数传递而不是硬编码(如果需要,这还允许您使用 tidyselect 功能)以及分组。
library(purrr)
library(dplyr)
calculateDifferences <- function(data, vars, diffYears, group = Group) {
data %>%
mutate(
across({{ vars }}, ~
map(diffYears, \(year)
tibble("Diff{year}" := .x - .x[Year == year])
) |>
list_cbind(),
.unpack = TRUE),
.by = {{ group }}
)
}
grades |>
calculateDifferences(c(Mean, PassR), c(2015, 2016))
Group Year Mean PassR Mean_Diff2015 Mean_Diff2016 PassR_Diff2015 PassR_Diff2016
1 A 2015 100 0.50 0 -10 0.00 -0.01
2 A 2016 110 0.51 10 0 0.01 0.00
3 A 2017 120 0.52 20 10 0.02 0.01
4 A 2018 130 0.53 30 20 0.03 0.02
5 B 2015 200 0.60 0 -20 0.00 -0.02
6 B 2016 220 0.62 20 0 0.02 0.00
7 B 2017 240 0.64 40 20 0.04 0.02
8 B 2018 260 0.66 60 40 0.06 0.04