使用data.table计算运行金额不会计算没有值的天数?

问题描述 投票:0回答:1

我有两个数据框,其中包含两个不同年份的值,按组分类,我想将它们组合在一起以计算每个组的运行金额,同时填写缺失的日期。这是我正在使用的示例数据框。

df1 <- data.frame( group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE), date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365), values = rnorm(365, 10, 1) ) df2 <- data.frame( group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE), date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365), values = rnorm(365, 10, 1) )
这会创建两个随机数据框,每个日期都有一个随机值,但仅针对一组,因此缺少日期。下面,我使用 data.table 函数将两个数据帧组合在一起,填充缺失的日期,并使用 

runner

 包计算每组的平均运行量。但我注意到,值为 0 的日期不计入平均值。例如,此代码生成的数据框:

df <- rbindlist(list(df1, df2)) idx <- df[,.(date = seq(min(date), max(date), "day")), by = group] setkey(df, group, date) setkey(idx, group, date) df <- df[idx] %>% setorder(date) %>% .[,.( date = lubridate::ymd(date), values = ifelse(is.na(values), 0, as.numeric(values)), running_values = runner::mean_run(x = values, k = 90, idx = date) ), by = group]
产生如下所示的数据框:

组日期价值观运行值g12023-01-049.4129.412g12023-01-0509.412g12023-01-0609.412g12023-01-0709.412g12023-01-0810.78810.100g12023-01-09010.100
但是,我期望的输出(使用 tidyverse 函数代替)是:

df <- bind_rows(df1, df2) %>% complete( nesting(group), date = seq.Date(min(date), max(date), by = "day") ) %>% group_by(group) %>% arrange(date) %>% mutate( values = ifelse(is.na(values), 0, as.numeric(values)), date = lubridate::ymd(date), running_values = runner::mean_run(x = values, k = 90, idx = date) )
这给出了我想要的数据框。

组日期价值观运行值g12023-01-0100g12023-01-0200g12023-01-0300g12023-01-049.4122.353g12023-01-0501.882g12023-01-0601.568
使用 tidyverse 方法还会使组在所有组中最早的日期开始,而 data.table 将仅使用每个组

每个组的最早日期,尽管这不是我主要关心的。如何使用 data.table 函数重现 tidyverse 表,这样我就不必加载这些依赖项?一个简单的解决方案似乎是将 NA 转换为任意小的数字 (0.000000001),这似乎工作得很好,但我想了解为什么会出现这种差异。 df1 <- data.frame( group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE), date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365), values = rnorm(365, 10, 1) ) df2 <- data.frame( group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE), date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365), values = rnorm(365, 10, 1) ) # Bad output (data.table method) df <- rbindlist(list(df1, df2)) idx <- df[,.(date = seq(min(date), max(date), "day")), by = group] setkey(df, group, date) setkey(idx, group, date) df <- df[idx] %>% setorder(date) %>% .[,.( date = lubridate::ymd(date), values = ifelse(is.na(values), 0, as.numeric(values)), running_values = runner::mean_run(x = values, k = 90, idx = date) ), by = group] # Desired output (tidyverse method) df <- bind_rows(df1, df2) %>% complete( nesting(group), date = seq.Date(min(date), max(date), by = "day") ) %>% group_by(group) %>% arrange(date) %>% mutate( values = ifelse(is.na(values), 0, as.numeric(values)), date = lubridate::ymd(date), running_values = runner::mean_run(x = values, k = 90, idx = date) )


r data.table tidyverse runner
1个回答
0
投票
CJ()

创建一个填充面板(每个组每天)并将其右连接到数据:

df <- rbind(df1,df2)[
  CJ(group=unique(df$group), date=seq(min(df$date), max(df$date), "day")),
  on=.(group, date)
  ][is.na(values), values:=0]

然后进行计算:

df[, running_values := runner::mean_run(x = values, k = 90, idx = date), by=.(group)]

© www.soinside.com 2019 - 2024. All rights reserved.