使用frollsum计算data.table中的滚动加权平均值

问题描述 投票:0回答:1

目标是按组计算加权平均值,窗口有 3 行,按最新行的顺序权重为 3、2、1。这与here的问题类似,但权重不是由列给出的。另外,我真的很想使用

frollsum()
,因为我正在处理大量数据并且需要它具有高性能。

我有一个使用

frollapply()
的解决方案:

library(data.table)

# Your data
set.seed(1)
DT <- data.table(group = rep(c(1, 2), each = 10), value = round(runif(n = 20, 1, 5)))

weights <- 1:3
k <- 3

weighted_average <- function(x) {
  sum(x * weights[1:length(x)]) / sum(weights[1:length(x)])
}

# Apply rolling weighted average
DT[, wtavg := shift(frollapply(value, k, weighted_average, align = "right", fill = NA)), 
   by = group]

DT 
#>     group value    wtavg
#>  1:     1     2       NA
#>  2:     1     2       NA
#>  3:     1     3       NA
#>  4:     1     5 2.500000
#>  5:     1     2 3.833333
#>  6:     1     5 3.166667
#>  7:     1     5 4.000000
#>  8:     1     4 4.500000
#>  9:     1     4 4.500000
#> 10:     1     1 4.166667
#> 11:     2     2       NA
#> 12:     2     2       NA
#> 13:     2     4       NA
#> 14:     2     3 3.000000
#> 15:     2     4 3.166667
#> 16:     2     3 3.666667
#> 17:     2     4 3.333333
#> 18:     2     5 3.666667
#> 19:     2     3 4.333333
#> 20:     2     4 3.833333

创建于 2023-11-27,使用 reprex v2.0.2

r data.table
1个回答
0
投票

只需使用三次 frollsum 即可显着提高速度:

shift((frollsum(value, 3) + frollsum(value, 2) + frollsum(value, 1)) / 6)

基准测试

set.seed(1)
n = 1000000
groups = 1:1000
DT <- data.table(group = rep(groups, each = n/length(groups)), value = round(runif(n = n, 1, 5)))


bench::mark(
  A = {
    DT[, shift(frollapply(value, k, weighted_average, align = "right", fill = NA)), 
   by = group]
  },
  B = {
    DT[, shift((frollsum(value, 3) + frollsum(value, 2) + frollsum(value, 1)) / 6),
       by = group]
  }
)

# # A tibble: 2 × 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
# 1 A             1.68s    1.68s     0.595    46.7MB     29.8     1    50      1.68s
# 2 B          112.11ms 119.48ms     7.67    100.8MB     15.3     4     8   521.21ms
# # ℹ 4 more variables: result <list>, memory <list>, time <list>, gc <list>
© www.soinside.com 2019 - 2024. All rights reserved.