data.table 的 GForce - 将多个函数应用于多个列(带有可选参数)

问题描述 投票:0回答:1

我的目标是将多个函数应用于多个列AND以打开GForce。

假设我有以下数据框

library(data.table)

df <- data.table(fruit = c('a', 'a', 'a', 'b')
                 , revenue = 1:4
                 , profit = c(2,NA,4,5)
                 ); df

   fruit revenue profit
1:     a       1      2
2:     a       2     NA
3:     a       3      4
4:     b       4      5

并且我想将多个函数应用于多个列(除了

fruit
之外的所有列)

# functions
y <- \(i) {c(min(i, na.rm = T)
             , max(i, na.rm = T)
             )
           }

# apply
df[, lapply(.SD, y)
   , fruit
   , verbose = T
   ]

Finding groups using forderv ... forder.c received 4 rows and 1 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization changed j from 'lapply(.SD, y)' to 'list(y(revenue), y(profit))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... 
  memcpy contiguous groups took 0.000s for 2 groups
  eval(j) took 0.012s for 2 calls
0.020s elapsed (0.020s cpu) 

   fruit revenue profit
1:     a       1      2
2:     a       3      4
3:     b       4      5
4:     b       4      5

现在,上面的方法就可以了! 但是,请注意它说的是

(GForce FALSE)
。所以 GForce NOT 已开启。

我认为这是因为,正如 Waldi 指出的那样,当使用

\(i) sum(i)
时,GForce NOT 开启。 然后我尝试了下面的方法并仅在
na.rm = T
 中通过了 
lapply

# functions
z <- \(i) {c(min
             , max
              )
           }

# apply
df[, lapply(.SD, z, na.rm = T)
   , fruit
   , verbose = T
   ]

Finding groups using forderv ... forder.c received 4 rows and 1 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization changed j from 'lapply(.SD, z, na.rm = T)' to 'list(z(revenue, na.rm = T), z(profit, na.rm = T))'
GForce is on, left j unchanged
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... Error in z(revenue, na.rm = T) : unused argument (na.rm = T)

这次错误如上。具体来说

Error in z(revenue, na.rm = T) : unused argument (na.rm = T)

任何帮助将不胜感激

r data.table
1个回答
0
投票

我可以给出的唯一相对简单的建议是不要尝试在单个

df[]
调用中执行此操作,而是进行两个单独的调用以使优化发挥作用。例如:

## bigger data example
df <- data.table(
    fruit = rep(1:2e6, each=2)
  , revenue = 1:4
  , profit = c(2,NA,4,5)
)

rbind(
    df[, lapply(.SD, min, na.rm=TRUE), by=fruit, verbose=TRUE],
    df[, lapply(.SD, max, na.rm=TRUE), by=fruit, verbose=TRUE]
)[order(fruit)]
##Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.008
##Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.002
##0.060s elapsed (0.050s cpu) 

y <- function(i) {
    c(min(i, na.rm = T),
      max(i, na.rm = T))
}

# apply
df[
  , lapply(.SD, y)
  , fruit
  , verbose = T
]
##Making each group and running j (GForce FALSE) ... 
##3.760s elapsed (3.770s cpu) 
© www.soinside.com 2019 - 2024. All rights reserved.