R 中缺少日期的时间序列的滚动平均值

问题描述 投票:0回答:3

R 菜鸟(仍然)在这里,在

tidyverse
/ RStudio 工作。

我有一个整洁的数据集,其中每一行都有一个日期、一个分组特征和一个值(实际数据集更复杂,但这是它的核心):

我为每个

Group
Date
对数据进行分组,并计算
Value
的一些汇总统计数据,为每个日期生成一个按组汇总。例如:

grouped <- data %>% group_by(Date, Group) %>% summarise(mean = mean(Value))
head(grouped)
# A tibble: 6 × 3
# Groups:   Date [4]
  Date       Group  mean
  <date>     <fct> <dbl>
1 2021-02-18 A      37.4
2 2021-02-19 B      25.5
3 2021-02-19 A      26.1
4 2021-02-22 B      34.2
5 2021-02-22 A      26.4
6 2021-02-23 B      34.2

(注意:为了再现性,数据如下。)

到目前为止一切顺利。现在我想通过

mean
获取这些摘要统计数据的移动平均值(在这种情况下为
Group
,但也可以是其他数据)。我用
zoo::rollmean
试过这个:

grouped <- grouped %>% 
    group_by(Group) %>% 
    mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))

但是这里出现了一个问题——理想情况下,移动平均应该严格地是一些days,而不是records,但是一个或两个组都缺少一些天数。

确保移动平均线正确考虑缺失天数 x 组的最佳方法是什么,根据需要将它们视为

NA

(我从this answer了解到

zoo::rollmean
无法处理
NA
值,但
zoo::rollapply
应该能够。)

我已经尝试创建一个简单的日历数据框,其中包含完整的日期集到

join
分组数据,但是这使得
Group
变量也为
NA
,所以缺失的天数x组仍然被忽略
rollmean / rollapply
功能。

希望一切都有意义!


data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680, 
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683, 
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684, 
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685, 
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687, 
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688, 
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689, 
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690, 
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691, 
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692, 
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694, 
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695, 
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696, 
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697, 
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698, 
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701, 
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
18703, 18703, 18703), class = "Date"), Group = structure(c(2L, 
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"), 
    Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93, 
    28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78, 
    25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76, 
    29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6, 
    32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12, 
    16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69, 
    39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61, 
    15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39, 
    19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25, 
    34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21, 
    20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85, 
    17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87, 
    19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35, 
    31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6, 
    41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49, 
    20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32, 
    38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68, 
    14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03, 
    29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42, 
    37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3, 
    37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92, 
    15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74, 
    36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33, 
    16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33, 
    38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44, 
    30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53, 
    23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75, 
    19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
    )), row.names = c(NA, -250L), class = c("tbl_df", "tbl", 
"data.frame"))
r tidyverse zoo rolling-computation
3个回答
1
投票

假设平均 3 天(当前点和前 2 天)而不是 3 行,并且日期已经在组内排序(问题中就是这种情况),我们计算要使用的行数(这将是一个向量因为每个点可以有不同的行数)并在

rollapplyr
中使用它。在每一行,它对当前行之前的所有行或当前行的所有行进行平均,这些行在当前行之前的 w 天内。这在不添加额外的 NA 行的情况下对原始数据帧执行平均。您可以在
?rollapply
.

的示例部分中找到其他示例
library(dplyr)
library(zoo)

w <- 3 
data %>%
  group_by(Group) %>%
  mutate(Npoints = 1:n() - findInterval(Date - w, Date),
         Mean3 = rollapplyr(Value, Npoints, mean, partial = TRUE, fill = NA)) %>%
  ungroup

给予:

# A tibble: 250 × 5
   Date       Group Value Npoints Mean3
   <date>     <fct> <dbl>   <int> <dbl>
 1 2021-02-18 A      37.4       1  37.4
 2 2021-02-19 A      26.1       2  31.8
 3 2021-02-19 B      25.5       1  25.5
 4 2021-02-22 A      31.6       1  31.6
 5 2021-02-22 A      27.0       2  29.3
 6 2021-02-22 A      15.3       3  24.6
 7 2021-02-22 A      35.9       4  27.5
 8 2021-02-22 A      28.6       5  27.7
 9 2021-02-22 A      17.1       6  25.9
10 2021-02-22 B      30.4       1  30.4
# … with 240 more rows

相反,如果您想包括当前行之前的行,如果它们等于当前行的日期,那么使用它。这里 L 是

rollapply
使用的偏移向量列表,使得
L[[i]]
是在第 i 行使用的偏移向量。

data %>%
  group_by(Group) %>%
  mutate(L = lapply(1:n(), 
      \(i) which(Date %in% seq(Date[i] - w, Date[i], "day")) - i),
    Mean3 = rollapplyr(Value, L, mean, partial = TRUE, fill = NA)) %>%
  ungroup %>%
  select(-L)

给予:

# A tibble: 250 × 4
   Date       Group Value Mean3
   <date>     <fct> <dbl> <dbl>
 1 2021-02-18 A      37.4  37.4
 2 2021-02-19 A      26.1  31.8
 3 2021-02-19 B      25.5  25.5
 4 2021-02-22 A      31.6  26.4
 5 2021-02-22 A      27.0  26.4
 6 2021-02-22 A      15.3  26.4
 7 2021-02-22 A      35.9  26.4
 8 2021-02-22 A      28.6  26.4
 9 2021-02-22 A      17.1  26.4
10 2021-02-22 B      30.4  32.0
# ℹ 240 more rows
# ℹ Use `print(n = ...)` to see more rows

0
投票

我写了一个专为这类问题设计的包(timeplyr)。

里面有一个函数

time_complete()
,通过任意时间聚合完成每个组的时间范围。然后,您可以使用任何滚动平均函数。

请参阅下面的示例。

data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680, 
                                          18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 
                                          18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
                                          18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 
                                          18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683, 
                                          18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 
                                          18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684, 
                                          18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685, 
                                          18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687, 
                                          18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688, 
                                          18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689, 
                                          18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690, 
                                          18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691, 
                                          18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692, 
                                          18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694, 
                                          18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 
                                          18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695, 
                                          18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696, 
                                          18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697, 
                                          18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698, 
                                          18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
                                          18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 
                                          18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701, 
                                          18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 
                                          18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
                                          18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 
                                          18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
                                          18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 
                                          18703, 18703, 18703), class = "Date"), Group = structure(c(2L, 
                                                                                                     2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 
                                                                                                     2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
                                                                                                     2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 
                                                                                                     2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 
                                                                                                     1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 
                                                                                                     2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L, 
                                                                                                     2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 
                                                                                                     2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 
                                                                                                     1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 
                                                                                                     2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 
                                                                                                     2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 
                                                                                                     2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 
                                                                                                     1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 
                                                                                                     1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 
                                                                                                     1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 
                                                                                                     2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"), 
                       Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93, 
                                 28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78, 
                                 25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76, 
                                 29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6, 
                                 32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12, 
                                 16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69, 
                                 39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61, 
                                 15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39, 
                                 19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25, 
                                 34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21, 
                                 20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85, 
                                 17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87, 
                                 19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35, 
                                 31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6, 
                                 41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49, 
                                 20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32, 
                                 38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68, 
                                 14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03, 
                                 29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42, 
                                 37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3, 
                                 37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92, 
                                 15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74, 
                                 36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33, 
                                 16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33, 
                                 38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44, 
                                 30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53, 
                                 23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75, 
                                 19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
                       )), row.names = c(NA, -250L), class = c("tbl_df", "tbl", 
                                                               "data.frame"))
remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
data %>%
  group_by(Group) %>%
  time_complete(time = Date, by = "day") %>%
  mutate(mean = frollmean2(Value, n = 7, na.rm = TRUE))
#> # A tibble: 258 x 4
#> # Groups:   Group [2]
#>    Date       Group Value  mean
#>    <date>     <fct> <dbl> <dbl>
#>  1 2021-02-19 B      25.5  25.5
#>  2 2021-02-20 B      NA    25.5
#>  3 2021-02-21 B      NA    25.5
#>  4 2021-02-22 B      30.4  28.0
#>  5 2021-02-22 B      35.3  30.4
#>  6 2021-02-22 B      36.9  32.0
#>  7 2021-02-23 B      25.5  30.7
#>  8 2021-02-23 B      40.2  33.7
#>  9 2021-02-23 B      38.9  34.5
#> 10 2021-02-23 B      36.8  34.9
#> # ... with 248 more rows

创建于 2023-03-29 与 reprex v2.0.2


-1
投票
library(dplyr)
library(zoo)

# Create a calendar dataframe with the full set of dates
calendar <- data.frame(Date = seq(min(data$Date), max(data$Date), by = "day"))

# join data and calendar by "Date" and "Group" columns
data_full <- full_join(data, calendar, by = c("Date"))

# Group the data by date and group and calculate the summary statistics of the value
grouped <- data_full %>% 
  group_by(Date, Group) %>% 
  summarise(mean = mean(Value)) 

# Group the resulting summary statistics data by group
grouped_by_group <- grouped %>% 
  group_by(Group) 

# Use rollapply() to calculate the moving average for each group separately
window_length <- 7  # the desired number of days for the moving average window
grouped_by_group <- grouped_by_group %>% 
  mutate(rolling = rollapply(mean, window_length, mean, fill = NA, align = "right"))
© www.soinside.com 2019 - 2024. All rights reserved.