此博客上有一个示例(非附属):
https://www.sumsar.net/blog/pandas-feels-clunky-when-coming-from-r/
比较 R (tidyverse) 和 python (pandas) 的数据操作(组、过滤器、 总结)。
最小的示例是一个包含三列的表格:
country
、amount
、discount
每个国家 2-3 个观察/行。
输出:
structure(list(country = c("USA", "USA", "USA", "Canada", "Canada",
"Canada", "UK", "UK", "UK", "France", "France", "France", "Germany",
"Germany", "Germany", "Australia", "Australia", "Australia",
"Italy", "Italy", "Italy", "Spain", "Spain", "Spain", "Japan",
"Japan", "Japan", "India", "India", "India", "Brazil", "Brazil"
), amount = c(2000L, 3500L, 3000L, 120L, 180L, 3100L, 130L, 160L,
190L, 110L, 170L, 220L, 140L, 200L, 230L, 150L, 210L, 240L, 160L,
220L, 250L, 170L, 230L, 260L, 180L, 240L, 270L, 190L, 250L, 280L,
200L, 260L), discount = c(10L, 15L, 20L, 12L, 18L, 21L, 13L,
16L, 19L, 11L, 17L, 22L, 14L, 20L, 23L, 15L, 21L, 24L, 16L, 22L,
25L, 17L, 23L, 26L, 18L, 24L, 27L, 19L, 25L, 28L, 20L, 26L)), class = "data.frame", row.names = c(NA,
-32L))
这个想法是按
country
分组,而不是过滤具有 amount
<= median(amount
)*10 的所有行。然后计算总计 = amount
- discount
在 dplyr API 中,它看起来像这样:
purchases |>
group_by(country) |>
filter(amount <= median(amount) * 10) |>
summarize(total = sum(amount - discount))
由于美国和加拿大的两种边缘情况,分组和过滤的顺序至关重要: (正确:)
1 Canada 270
2 USA 8455
如果我们根本不过滤,除加拿大之外的所有国家/地区的总数将相同;
1 Canada 3349
2 USA 8455
如果我们在分组之前进行过滤,那么除美国之外的所有国家/地区的结果都是正确的。
1 Canada 270
2 USA 1990
使用基本 R
by
我可以重现正确的结果:
purchases|>
by(
country,
\(x) sum(x$amount[x$amount <= median(x$amount)*10]-x$discount[x$amount <= median(x$amount)*10])
)
但由于输出格式的原因,它没有用。
使用聚合我只能重现过滤器不
country
特定的示例:
aggregate(
amount - discount ~ country,
data = purchases,
FUN = sum,
subset = amount <= median(amount)*10
)
i,e 美国 1990 年的总值错误。
并使用
data.table
我能够得到正确的结果,
purchases[,.(total = sum(amount[amount <= median(amount)*10] - discount[amount <= median(amount)*10])), country][order(country)]
但前提是我在两列上应用子集/过滤器/逻辑索引:
amount
和discount
;
在我看来,应该避免这种情况,因为它是重复的,并且可能容易出错。
所以最终我的问题是:
我的
data.table
有点生疏了,但我也许会这样做:
library(data.table)
setDT(dt)
dt[, cond := amount <= median(amount) * 10, by = country][
(cond), .(total = sum(amount - discount)), by = country
]
# country total
# 1: USA 8455
# 2: Canada 270
# 3: UK 432
# 4: France 450
# 5: Germany 513
# 6: Australia 540
# 7: Italy 567
# 8: Spain 594
# 9: Japan 621
# 10: India 648
# 11: Brazil 414
我可以在 data.table 代码中想到一个可以减少重复性的选项:
使用
.SD
访问每个country
组内的子data.table:
purchases[,
.SD[amount <= median(amount) * 10, .(total = sum(amount - discount))],
by=country
]
# country total
# <char> <int>
# 1: USA 8455
# 2: Canada 270
# 3: UK 432
# 4: France 450
# 5: Germany 513
# 6: Australia 540
# 7: Italy 567
# 8: Spain 594
# 9: Japan 621
#10: India 648
#11: Brazil 414
在基础 R 中我们可以使用这个:
purchases |>
transform(median = ave(amount, country, FUN = median), total = amount - discount) |>
subset(amount <= 10 * median) |>
aggregate(total ~ country, data = _, FUN = sum)