优化矢量化操作的 R 代码

Question

这是一段目前运行良好的 R 代码：

是

"numeric"

、

unit

是

"character"

和

digits

是

"numeric"

process <- function(x, unit, digits) { 
  if (unit == "$") {
    x <- comprss(x)
    } else {
      x <- reformat(x,digits)
      }
  } 

x <- mapply(process, x, unit, digits)

我使用

mapply

因为我想将函数应用到向量

。

是

data.table

的一列，

unit

和

digits

也是，它们的长度都相同。

我可以对函数

comprss

和

reformat

进行向量化，并将它们与 x 上的条件布尔向量一起应用，但这是我的问题：

函数

comprss

和

reformat

都接受一个整数作为参数并返回一个字符。因此，对于矢量化函数，如果我应用第一个函数，x 的类会从

"numeric"

更改为

"character"

，并且不能再应用第二个函数，这就是为什么我使用

mapply

而不是矢量化函数。

但是这种方法并没有利用矢量化运算，而且速度相当慢。

Answer 1

使用 dplyr，它比你的方法快得多（尽管很难猜测你的函数到底做了什么）

library(tidyverse)
library(stringi)
library(rbenchmark)

comprss <- function(x) {
  paste(x)
}

reformat <- function(x, digits){
  format(x,nsmall = digits)
}

process <- function(x, unit, digits) { 
  if (unit == "$") {
    x <- comprss(x)
  } else {
    x <- reformat(x,digits)
  }
} 

x <- runif(100000,1,20000)
unit <- stri_rand_strings(100000,1,"[$€]")
digits <- floor(runif(100000,1,10))
df <- data.frame(x,
                 unit,
                 digits)

benchmark("dplyr" = {
  y <- df %>% mutate(y = if_else(unit == "$", comprss(x), reformat(x, digits))) %>% pull(y)
},
"question" = {
  y <- mapply(process, x, unit, digits)
},
replications = 5)

      test replications elapsed relative user.self sys.self user.child sys.child
1    dplyr            5    0.86    1.000      0.83     0.00         NA        NA
2 question            5    5.94    6.907      5.67     0.08         NA        NA

这是您作为 dplyr 的过程函数

if_else

:

y <- df %>% mutate(y = if_else(unit == "$", comprss(x), reformat(x, digits))) %>% pull(y)

Answer 2

既然您询问了data.table，那么是的，按组更改列的类不会很好地工作：

library(data.table)
MT <- as.data.table(head(mtcars))
MT[, disp := paste("qq", disp), by = cyl]
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   Coercing 'character' RHS to 'double' to match the type of the target column (column 0 named '').
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   NAs introduced by coercion
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   Coercing 'character' RHS to 'double' to match the type of the target column (column 0 named '').
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   NAs introduced by coercion
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   Coercing 'character' RHS to 'double' to match the type of the target column (column 0 named '').
# Warning in `[.data.table`(MT, , `:=`(disp, paste("qq", disp)), by = cyl) :
#   NAs introduced by coercion

但是您可以通过分配给新的或已经是字符串（一次性）列然后重新分配来实现：

MT[, disp2 := paste("qq", disp), by = cyl][, disp := disp2][, disp2 := NULL][]
#      mpg   cyl   disp    hp  drat    wt  qsec    vs    am  gear  carb
#    <num> <num> <char> <num> <num> <num> <num> <num> <num> <num> <num>
# 1:  21.0     6 qq 160   110  3.90 2.620 16.46     0     1     4     4
# 2:  21.0     6 qq 160   110  3.90 2.875 17.02     0     1     4     4
# 3:  22.8     4 qq 108    93  3.85 2.320 18.61     1     1     4     1
# 4:  21.4     6 qq 258   110  3.08 3.215 19.44     1     0     3     1
# 5:  18.7     8 qq 360   175  3.15 3.440 17.02     0     0     3     2
# 6:  18.1     6 qq 225   105  2.76 3.460 20.22     1     0     3     1

就你而言，我认为是

DT[, x2 := mapply(process, x, unit, digits), by = yourgroup
  ][, x := x2][, x2 := NULL]

优化矢量化操作的 R 代码

问题描述投票：0回答：2

2个回答

最新问题

优化矢量化操作的 R 代码

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2