我对dplyr相当陌生,我想做以下计算。
我有这样的df,每组(队列);每组的数值与顺序_编号参考有关
library(tidyverse)
df <- tibble::tribble(
~cohort, ~order_number, ~post, ~pre,
"2019-06", 0, 138.86, 163.36,
"2019-06", 3, 148.54, 174.75,
"2019-06", 6, 192.52, 226.5,
"2019-06", 9, 233.32, 283.5,
"2019-07", 0, 127.81, 150.37,
"2019-07", 3, 140.16, 164.83,
"2019-07", 6, 121.51, 142.93,
"2019-07", 9, 138.71, 162.86
)
# A tibble: 8 x 4
cohort order_number post pre
<chr> <dbl> <dbl> <dbl>
1 2019-06 0 139. 163.
2 2019-06 3 149. 175.
3 2019-06 6 193. 226.
4 2019-06 9 233. 284.
5 2019-07 0 128. 150.
6 2019-07 3 140. 165.
7 2019-07 6 122. 143.
8 2019-07 9 139. 163.
我想进行这些计算。
也就是在第一步(顺序0)时,我做139139=1,在第二步(顺序1)时,我做139149=0.93,以此类推,对每一个队列;对两个数字列。
结果如下。
df_calc <- data.frame(stringsAsFactors=FALSE,
cohort = c("2019-06", "2019-06", "2019-06", "2019-06",
"2019-07", "2019-07", "2019-07",
"2019-07"),
order_number = c(0, 3, 6, 9, 0, 3, 6, 9),
post = c(138.86, 148.54, 192.52, 233.32, 127.81, 140.16,
121.51, 138.71),
pre = c(163.36, 174.75, 226.5, 283.5, 150.37, 164.83,
142.93, 162.86),
perc_per_group_post = c(1, 0.93, 0.72, 0.6, 1, 0.91, 1.05, 0.92),
perc_per_group_pre = c(1, 0.93, 0.72, 0.58, 1, 0.91, 1.05, 0.92)
)
cohort order_number post pre perc_per_group_post perc_per_group_pre
1 2019-06 0 138.86 163.36 1.00 1.00
2 2019-06 3 148.54 174.75 0.93 0.93
3 2019-06 6 192.52 226.50 0.72 0.72
4 2019-06 9 233.32 283.50 0.60 0.58
5 2019-07 0 127.81 150.37 1.00 1.00
6 2019-07 3 140.16 164.83 0.91 0.91
7 2019-07 6 121.51 142.93 1.05 1.05
8 2019-07 9 138.71 162.86 0.92 0.92
然后重塑。
df_calc_reshape <- data.frame(stringsAsFactors=FALSE,
cohort = c("2019-06", "2019-06", "2019-06", "2019-06", "2019-07",
"2019-07", "2019-07", "2019-07",
"2019-06", "2019-06", "2019-06", "2019-06",
"2019-07", "2019-07", "2019-07", "2019-07"),
order_number = c(0, 3, 6, 9, 0, 3, 6, 9, 0, 3, 6, 9, 0, 3, 6, 9),
ret_post = c(1, 0.93, 0.72, 0.6, 1, 0.91, 1.05, 0.92, 1, 0.93, 0.72,
0.58, 1, 0.91, 1.05, 0.92),
type = c("perc_per_group_post", "perc_per_group_post",
"perc_per_group_post",
"perc_per_group_post", "perc_per_group_post",
"perc_per_group_post", "perc_per_group_post",
"perc_per_group_post", "perc_per_group_pre",
"perc_per_group_pre", "perc_per_group_pre",
"perc_per_group_pre", "perc_per_group_pre",
"perc_per_group_pre", "perc_per_group_pre",
"perc_per_group_pre")
)
cohort order_number ret_post type
1 2019-06 0 1.00 perc_per_group_post
2 2019-06 3 0.93 perc_per_group_post
3 2019-06 6 0.72 perc_per_group_post
4 2019-06 9 0.60 perc_per_group_post
5 2019-07 0 1.00 perc_per_group_post
6 2019-07 3 0.91 perc_per_group_post
7 2019-07 6 1.05 perc_per_group_post
8 2019-07 9 0.92 perc_per_group_post
9 2019-06 0 1.00 perc_per_group_pre
10 2019-06 3 0.93 perc_per_group_pre
11 2019-06 6 0.72 perc_per_group_pre
12 2019-06 9 0.58 perc_per_group_pre
13 2019-07 0 1.00 perc_per_group_pre
14 2019-07 3 0.91 perc_per_group_pre
15 2019-07 6 1.05 perc_per_group_pre
16 2019-07 9 0.92 perc_per_group_pre
用dplyr应该很容易吧?
我想我可以使用mutate,但我不知道如何对结果进行分组,然后再重塑,我会用gather来做,但没有第一步我就卡住了。
这里有一个简单的方法,按照你的逻辑,即
library(dplyr)
library(tidyr)
df %>%
group_by(cohort) %>%
mutate_at(vars(c('post', 'pre')), list(new =~ first(.) / .)) %>%
select(-c('post', 'pre')) %>%
pivot_longer(cols = c('post_new', 'pre_new'),
names_to = 'type',
values_to = 'ret_post')
这给出了。
# A tibble: 16 x 4 # Groups: cohort [2] cohort order_number type ret_post <chr> <dbl> <chr> <dbl> 1 2019-06 0 post_new 1 2 2019-06 0 pre_new 1 3 2019-06 3 post_new 0.935 4 2019-06 3 pre_new 0.935 5 2019-06 6 post_new 0.721 6 2019-06 6 pre_new 0.721 7 2019-06 9 post_new 0.595 8 2019-06 9 pre_new 0.576 9 2019-07 0 post_new 1 10 2019-07 0 pre_new 1 11 2019-07 3 post_new 0.912 12 2019-07 3 pre_new 0.912 13 2019-07 6 post_new 1.05 14 2019-07 6 pre_new 1.05 15 2019-07 9 post_new 0.921 16 2019-07 9 pre_new 0.923
您可以通过以下方式进一步概括长格式 pivot_longer()
争论。你可以找到更多的资料 此处