我有两个数据框,其中包含两个不同年份的值,按组分类,我想将它们组合在一起以计算每个组的运行金额,同时填写缺失的日期。这是我正在使用的示例数据框。
df1 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
df2 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
这会创建两个随机数据框,每个日期都有一个随机值,但仅针对一组,因此缺少日期。下面,我使用 data.table 函数将两个数据帧组合在一起,填充缺失的日期,并使用 runner
包计算每组的平均运行量。但我注意到,值为 0 的日期不计入平均值。例如,此代码生成的数据框:
df <- rbindlist(list(df1, df2))
idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)
df <- df[idx] %>%
setorder(date) %>%
.[,.(
date = lubridate::ymd(date),
values = ifelse(is.na(values), 0, as.numeric(values)),
running_values = runner::mean_run(x = values, k = 90, idx = date)
), by = group]
产生如下所示的数据框:
df <- bind_rows(df1, df2) %>%
complete(
nesting(group),
date = seq.Date(min(date), max(date), by = "day")
) %>%
group_by(group) %>%
arrange(date) %>%
mutate(
values = ifelse(is.na(values), 0, as.numeric(values)),
date = lubridate::ymd(date),
running_values = runner::mean_run(x = values, k = 90, idx = date)
)
这给出了我想要的数据框。
每个组的最早日期,尽管这不是我主要关心的。如何使用 data.table 函数重现 tidyverse 表,这样我就不必加载这些依赖项?一个简单的解决方案似乎是将 NA 转换为任意小的数字 (0.000000001),这似乎工作得很好,但我想了解为什么会出现这种差异。
df1 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
df2 <- data.frame(
group = sample(c('g1', 'g2', 'g3', 'g4'), 365, replace = TRUE),
date = sample(seq(as.Date('2023-01-01'), as.Date('2023-12-31'), by = "day"), 365),
values = rnorm(365, 10, 1)
)
# Bad output (data.table method)
df <- rbindlist(list(df1, df2))
idx <- df[,.(date = seq(min(date), max(date), "day")), by = group]
setkey(df, group, date)
setkey(idx, group, date)
df <- df[idx] %>%
setorder(date) %>%
.[,.(
date = lubridate::ymd(date),
values = ifelse(is.na(values), 0, as.numeric(values)),
running_values = runner::mean_run(x = values, k = 90, idx = date)
), by = group]
# Desired output (tidyverse method)
df <- bind_rows(df1, df2) %>%
complete(
nesting(group),
date = seq.Date(min(date), max(date), by = "day")
) %>%
group_by(group) %>%
arrange(date) %>%
mutate(
values = ifelse(is.na(values), 0, as.numeric(values)),
date = lubridate::ymd(date),
running_values = runner::mean_run(x = values, k = 90, idx = date)
)
CJ()
创建一个填充面板(每个组每天)并将其右连接到数据:
df <- rbind(df1,df2)[
CJ(group=unique(df$group), date=seq(min(df$date), max(df$date), "day")),
on=.(group, date)
][is.na(values), values:=0]
然后进行计算:
df[, running_values := runner::mean_run(x = values, k = 90, idx = date), by=.(group)]