注意:这与我之前在这里问过的一个问题有些相关
这里是我的数据的一个子集,例如:
library(dplyr)
DF <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19), day = c("day1", "day2", "day3",
"day4", "day5", "day6", "day6", "day7", "day8", "day9", "day10",
"day10", "day11", "day12", "day13", "day14", "day14", "day14",
"day14"), sent_to = c(NA, NA, "Blue Superstore", "Garden Cinema",
"Pasta House", NA, NA, "Pizzaria", NA, "Ice Palace", NA, NA,
"Shoes Centre", "Dreams Dessert", NA, "Chicken World", "Art Gallery",
"Smoothie Hut", NA), received_from = c("ATM", "Sarah", NA, NA,
NA, "Jane", "Joe", NA, "Sarah", NA, "Anna", "Jane", NA, NA, "Anna",
NA, NA, NA, "Joe"), reference = c("add_cash", "gift", "shopping",
"cinema_tickets", "meal", "reimbursed", "reimbursed", "meal",
"reimbursed", "ice_rink_tickets", "reimbursed", "reimbursed",
"shoes", "ice_cream", "reimbursed", "meal", "gallery_ticket",
"drink", "reimbursed"), decrease = c(0, 0, 15.2, 10.8, 12.5,
0, 0, 10, 0, 18, 0, 0, 15, 6.5, 0, 8, 3.5, 2, 0), increase = c(50,
30, 0, 0, 0, 5.4, 7.25, 0, 10, 0, 6, 6, 0, 0, 21.5, 0, 0, 0,
13.5), reimbursed_id = c(NA, NA, NA, "R", "R", "4", "5", "R",
"8", "R", "10", "10", "R", "R", "13, 14", "R", "R", "R", "16, 17, 18"
), change = c(50, 30, -15.2, -10.8, -12.5, 5.4, 7.25, -10, 10,
-18, 6, 6, -15, -6.5, 21.5, -8, -3.5, -2, 13.5), balance = c(50,
80, 64.8, 54, 41.5, 46.9, 54.15, 44.15, 54.15, 36.15, 42.15,
48.15, 33.15, 26.65, 48.15, 40.15, 36.65, 34.65, 48.15)), row.names = c(NA,
-19L), class = c("tbl_df", "tbl", "data.frame"))
> DF
# A tibble: 19 × 10
id day sent_to received_from reference decrease increase reimbursed_id change balance
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 day1 NA ATM add_cash 0 50 NA 50 50
2 2 day2 NA Sarah gift 0 30 NA 30 80
3 3 day3 Blue Superstore NA shopping 15.2 0 NA -15.2 64.8
4 4 day4 Garden Cinema NA cinema_tickets 10.8 0 R -10.8 54
5 5 day5 Pasta House NA meal 12.5 0 R -12.5 41.5
6 6 day6 NA Jane reimbursed 0 5.4 4 5.4 46.9
7 7 day6 NA Joe reimbursed 0 7.25 5 7.25 54.2
8 8 day7 Pizzaria NA meal 10 0 R -10 44.2
9 9 day8 NA Sarah reimbursed 0 10 8 10 54.2
10 10 day9 Ice Palace NA ice_rink_tickets 18 0 R -18 36.2
11 11 day10 NA Anna reimbursed 0 6 10 6 42.2
12 12 day10 NA Jane reimbursed 0 6 10 6 48.2
13 13 day11 Shoes Centre NA shoes 15 0 R -15 33.2
14 14 day12 Dreams Dessert NA ice_cream 6.5 0 R -6.5 26.6
15 15 day13 NA Anna reimbursed 0 21.5 13, 14 21.5 48.2
16 16 day14 Chicken World NA meal 8 0 R -8 40.2
17 17 day14 Art Gallery NA gallery_ticket 3.5 0 R -3.5 36.6
18 18 day14 Smoothie Hut NA drink 2 0 R -2 34.6
19 19 day14 NA Joe reimbursed 0 13.5 16, 17, 18 13.5 48.2
reimbursed_id
栏的解释:
R
表示decrease
列中的值不代表用户的实际支出,因为它包括代表某人支付的金额4
(或任何数字)表示用户被报销的id(归还借入的金额)13, 14
(或逗号分隔的数字列表)代表用户报销的 id,但跨越多个交易期望的结果:
我想在这个数据集中添加一个
actual_decrease
列,它基本上查看reimbursed_id
列,记录影响其他行的ID,在increase
列中收集所述行的报销金额,并从decrease
中相应 ID 的值。
更多详情:
请参考下图(包含我希望
actual_decrease
列看起来像的东西):
如您在屏幕截图中所见,根据
reimbursed_id
列的内容,有几种不同类型的计算已应用于每一行。
如果标记为“R”,则
actual_decrease
的计算将取决于报销是否用于:
如果没有“R”标记,那么
actual_decrease
的计算将只是 decrease
中的值。
到目前为止,我只有以下内容(基于我之前提出的一个问题):
DF %>%
left_join(DF %>%
filter(reference == "reimbursed") %>%
group_by(id = as.numeric(reimbursed_id)) %>% # removes row 15 and 19 (contains comma-separated values)
summarise(actual_decrease = sum(increase)),
by = "id") %>%
mutate(actual_decrease = ifelse(!is.na(actual_decrease),
decrease - actual_decrease,
decrease))
# A tibble: 19 × 11
id day sent_to received_from reference decrease increase reimbursed_id change balance actual_decrease
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 day1 NA ATM add_cash 0 50 NA 50 50 0
2 2 day2 NA Sarah gift 0 30 NA 30 80 0
3 3 day3 Blue Superstore NA shopping 15.2 0 NA -15.2 64.8 15.2
4 4 day4 Garden Cinema NA cinema_tickets 10.8 0 R -10.8 54 5.4
5 5 day5 Pasta House NA meal 12.5 0 R -12.5 41.5 5.25
6 6 day6 NA Jane reimbursed 0 5.4 4 5.4 46.9 0
7 7 day6 NA Joe reimbursed 0 7.25 5 7.25 54.2 0
8 8 day7 Pizzaria NA meal 10 0 R -10 44.2 0
9 9 day8 NA Sarah reimbursed 0 10 8 10 54.2 0
10 10 day9 Ice Palace NA ice_rink_tickets 18 0 R -18 36.2 6
11 11 day10 NA Anna reimbursed 0 6 10 6 42.2 0
12 12 day10 NA Jane reimbursed 0 6 10 6 48.2 0
13 13 day11 Shoes Centre NA shoes 15 0 R -15 33.2 15
14 14 day12 Dreams Dessert NA ice_cream 6.5 0 R -6.5 26.6 6.5
15 15 day13 NA Anna reimbursed 0 21.5 13, 14 21.5 48.2 0
16 16 day14 Chicken World NA meal 8 0 R -8 40.2 8
17 17 day14 Art Gallery NA gallery_ticket 3.5 0 R -3.5 36.6 3.5
18 18 day14 Smoothie Hut NA drink 2 0 R -2 34.6 2
19 19 day14 NA Joe reimbursed 0 13.5 16, 17, 18 13.5 48.2 0
但是这段代码没有显示我想要的所有计算类型的
actual_decrease
列的输出——也就是说,从第 13 行开始它是不正确的。
因为,我的实际数据集非常大,我宁愿避免使用循环。
非常感谢艾米的帮助:)
编辑: 这就是我希望数据集的样子:
# A tibble: 19 × 9
id day sent_to received_from reference decrease increase reimbursed_id actual_decrease
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <dbl>
1 1 day1 NA ATM add_cash 0 50 NA 0
2 2 day2 NA Sarah gift 0 30 NA 0
3 3 day3 Blue Superstore NA shopping 15.2 0 NA 15.2
4 4 day4 Garden Cinema NA cinema_tickets 10.8 0 R 5.4
5 5 day5 Pasta House NA meal 12.5 0 R 5.25
6 6 day6 NA Jane reimbursed 0 5.4 4 0
7 7 day6 NA Joe reimbursed 0 7.25 5 0
8 8 day7 Pizzaria NA meal 10 0 R 0
9 9 day8 NA Sarah reimbursed 0 10 8 0
10 10 day9 Ice Palace NA ice_rink_tickets 18 0 R 6
11 11 day10 NA Anna reimbursed 0 6 10 0
12 12 day10 NA Jane reimbursed 0 6 10 0
13 13 day11 Shoes Centre NA shoes 15 0 R 0
14 14 day12 Dreams Dessert NA ice_cream 6.5 0 R 0
15 15 day13 NA Anna reimbursed 0 21.5 13, 14 0
16 16 day14 Chicken World NA meal 8 0 R 0
17 17 day14 Art Gallery NA gallery_ticket 3.5 0 R 0
18 18 day14 Smoothie Hut NA drink 2 0 R 0
19 19 day14 NA Joe reimbursed 0 13.5 16, 17, 18 0