附加列data.table(作为参数传递给函数)

问题描述 投票:1回答:2

我传递一些data.tables一个函数,并希望收集了多个函数调用传递data.tables日益增长的结果。的行被添加(附加)的功能之内。

有没有办法“参照/就地”追加行到data.table

任何解决办法,如果这是不可能的?

编辑:我的目标是在函数里一次添加多个行和列的数量可以是非常大的(这就是为什么我使用的是“data.table”)。

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

预期成绩:

checked.rules
# rule.id  rule.name
# 1:       3  too late
# 2:       1   too big
# 3:       4     empty

valid.results
# err.code  msg
# 1:        3 some blah blah
# 2:        1 some blah blah
# 3:        4 some blah blah
r data.table
2个回答
1
投票

作为由@Henrik提供的链接已经提到目前data.tables不能被引用添加行。因此我想用rbindlist去(这也工作得很好,以添加多行):

library(data.table)

validate <- function(data, rule, valid.result, checked.rules) {
  # ... find errors

  # How to append "rule" to "checked.rules"?
  checked.rules <<- rbindlist(list(checked.rules, rule))

  findings <- data.table(err.code = rule$rule.id, msg = "some blah blah")  # just an stupid example
  # How to append all "finding"s to "valid.results"?
  valid.result <<- rbindlist(list(valid.result, findings))
}

data          <- data.table(a=1:10, b=21:30)
valid.result  <- data.table(err.code = integer(0), msg       = character(0))  # empty validation results table
checked.rules <- data.table(rule.id  = integer(0), rule.name = character(0))  # empty table
rules         <- data.table(rule.id  = 1:4,        rule.name = c("too big", "too small", "too late", "empty"))

validate(data, rules[3, ], valid.result, checked.rules)
validate(data, rules[1, ], valid.result, checked.rules)
validate(data, rules[4, ], valid.result, checked.rules)

print(checked.rules)
print(valid.result)

1
投票

阅读评论的链接和@ismirsehregal的建议使用list我结束了使用environment这样我就可以“参照”收集多个结果之后。

我做了一个标杆两个变种:

  1. rbind在每个函数调用进入“累积”的结果结束时的中间结果(“函数内追加”)。
  2. 收集每个函数调用和rbindlist仅一次在端部中间结果(“追加功能的外部”)。

该代码被简化导致生根粉。经过20函数调用9个MIO行:

library(data.table)
library(microbenchmark)

validate.rbind <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  results$valid.result <- rbind(results$valid.result, findings) # same as: rbindlist(list(results$valid.result, findings))
}

validate.rbindlist <- function(data, results) {
  findings <- data.table(err.code = 100, msg = rep("some blah blah", sample(1E6, 1) + 1))  # just an stupid example
  assign(paste0("res", sprintf("%02d", results$counter)), findings, envir = results)
  results$counter = results$counter + 1
}

microbenchmark(
  rbind.per.call = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$valid.result <- data.table(err.code = integer(0), msg = character(0))  # empty validation results table
    for (i in 1:20) {
      validate.rbind(data, results)
    }
  },
  rbindlist.once = {
    set.seed(0815)   # make random numbers reproducible
    data                 <- data.table(a=1:100, b=21:30)
    results              <- new.env()   # use an environment to pass arguments by reference
    results$counter      <- 1
    for (i in 1:20) {
      validate.rbindlist(data, results)
    }
    result.vars <- ls(envir = results, pattern = "^res.*")  # identify the result tables via the used naming pattern
    results$valid.result <- rbindlist(mget(result.vars, envir = results))
    rm(list = result.vars, envir = results)  # remove the intermediate result tables (keep only the total result)
  },
  times = 10)

方案二是快四倍

Unit: milliseconds
           expr       min        lq      mean    median        uq       max neval
 rbind.per.call 1021.2956 1114.8187 1198.7033 1153.7775 1324.6672 1477.5669    10
 rbindlist.once  231.0477  249.7195  305.0974  260.2499  275.3446  713.1155    10

和存储器足迹(与gc()观察到的),甚至更好:

# Memory consumption for rbind.per.call:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   510152  27.3     940480  50.3   847768  45.3
# Vcells 19636460 149.9   55027624 419.9 52254173 398.7

# Memory consumption for rbindlist.once:
#            used (Mb)  gc trigger  (Mb) max used  (Mb)
# Ncells   604335  32.3    1168576  62.5   940480  50.3
# Vcells 19859703 151.6   55503896 423.5 39082073 298.2

PS:我没有测试链接set变化,因为我不希望一个更好的性能,因为它是更复杂的使用

© www.soinside.com 2019 - 2024. All rights reserved.