具有自定义函数的Data.table

Question

我是 data.table 的新手，来自 dplyr。我有以下自定义功能选项卡：

tabs <- function(dt, x) {
tab2 <- dt[!is.na(x), ][, .(Freq = sum(nwgt0)), by = .(inc_cat, year, x)][, Prop := Freq / sum(Freq), by= .(inc_cat, year)][order(inc_cat, year)][x == 1 & !is.na(inc_cat), ] %>%
   ggplot(., aes(x= year, y = Prop, color = factor(inc_cat, levels = c(1,2,3,4),labels = c("0% to 100% FPL", "101-138% FPL", "139-200% FPL", ">200% FPL")))) +
    labs(color = "Income Categories") +
    geom_line() +
    theme_minimal() +
  ylab("Weighted proportion") +
   theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  )
return(tab2)
}

我现在希望调用功能选项卡。

我已尝试以下方法（不起作用）：

result <- hints_dt[ , tabs(.SD, x='internet_use')]

并收到以下错误：

Error in `[.data.table`(dt[!is.na(x), ], , .(Freq = sum(nwgt0)), by = .(inc_cat,  : 
  The items in the 'by' or 'keyby' list are length(s) (22344,22344,1). Each must be length 22344; the same length as there are rows in x (after subsetting if i is provided).

应该使用 .SDcols 来指定 internet_use 列。如果是这样，我该如何修改我的功能？

谢谢，

费利佩

编辑：根据下面的评论，我在这里添加了

reprex

。使用 NHANES 的数据

data("nhanes")

我调整了函数

tabs

：

tabs <- function(dt, x) {
tab2 <- dt[!is.na(x), ][, .(Freq = sum(WTMEC2YR)), by = .(race, agecat, x)][, Prop := Freq / sum(Freq), by= .(race, agecat)][order(race, agecat)][x == 1 & !is.na(race), ] %>%
   ggplot(., aes(x= year, y = Prop, color = factor(race, levels = c(1,2,3,4),labels = c("hispanic", "white", "black", "other")))) +
    labs(color = "Race") +
    geom_line() +
    theme_minimal() +
  ylab("Weighted proportion") +
   theme(
  panel.border = element_blank(),
  panel.grid.major = element_blank(),
  panel.grid.minor = element_blank(),
  )
return(tab2)
}

当我运行

result <- nhanes[ , tabs(.SD, x="RIAGENDR")]

时，我能够重现我的错误：

Error in `[.data.table`(dt[!is.na(get(x)), ], , .(Freq = sum(WTMEC2YR)),  : 
  The items in the 'by' or 'keyby' list are length(s) (8591,8591,1). Each must be length 8591; the same length as there are rows in x (after subsetting if i is provided).

Answer 1

get(x)

适用于

data.table::`:=`

操作员的 LHS/RHS，

MT <- as.data.table(mtcars)
fun <- function(DT, v) DT[!(get(v) == 4),]
fun(MT, "cyl") # WORKS

但是您在

by=

内使用非标准评估（NSE）将不起作用。

注意：为了这个论点，我通过让函数具有内置的
by
分组硬编码来模仿您的代码。如果该函数只能与特定数据集一起使用，那么这通常没问题，但如果您尝试概括该函数，请知道您不应该“永远”在对其他数据的更一般调用中假设字段。

fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = .(gear, by)][]
fun2(MT, v="disp", by="cyl")
# Error in `[.data.table`(DT, , lapply(.SD, sum), .SDcols = v, by = .(gear,  : 
#   The items in the 'by' or 'keyby' list are length(s) (32,1). Each must be length 32; the same length as there are rows in x (after subsetting if i is provided).

我们也可以在NSE

get(by)

内使用

by=

，

fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = .(gear, get(by))][]
fun2(MT, v="disp", by="cyl") # works

但情况可能并非总是如此。我发现在这些情况下，最好记住

by=

可以是您正在使用的 NSE 或字符向量。

fun2 <- function(DT, v, by) DT[, lapply(.SD, sum), .SDcols = v, by = c("gear", by)][]
fun2(MT, v="disp", by="cyl") # works

使用

by=c(..)

代替

by=.(..)

。这也可以用于不等连接，其中

data.table

在内部解析和评估它们，例如

by=c("gear", paste(v, ">", otherv))

（假设我们有另一个变量

otherv

用于连接比较）。

从这里开始，无论您在函数的其余部分中做什么，都应该尝试做同样的事情：使用

作为字符向量。

请注意，我设置了此函数，以便

的长度可以为 1 或更大。

要添加足够的变量以在

ggplot(.)

表达式中工作，其他一些技巧可能会使函数更具防御性。

fun3 <- function(DT, v, x = "year", y = "Prop") {
  stopifnot(all(c(v, x, y) %in% names(DT)))
  library(ggplot2)
  DT[!is.na(get(v)) & get(v) > 4,] |>
    ggplot(aes(x = .data[[x]], y = .data[[y]])) +
    geom_point() +
    theme_minimal()
}
fun3(MT, "cyl", x="mpg", y="disp")

具有自定义函数的Data.table

问题描述投票：0回答：1

1个回答

最新问题

具有自定义函数的Data.table

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1