R 菜鸟(仍然)在这里,在
tidyverse
/ RStudio 工作。
我有一个整洁的数据集,其中每一行都有一个日期、一个分组特征和一个值(实际数据集更复杂,但这是它的核心):
我为每个
Group
按Date
对数据进行分组,并计算Value
的一些汇总统计数据,为每个日期生成一个按组汇总。例如:
grouped <- data %>% group_by(Date, Group) %>% summarise(mean = mean(Value))
head(grouped)
# A tibble: 6 × 3
# Groups: Date [4]
Date Group mean
<date> <fct> <dbl>
1 2021-02-18 A 37.4
2 2021-02-19 B 25.5
3 2021-02-19 A 26.1
4 2021-02-22 B 34.2
5 2021-02-22 A 26.4
6 2021-02-23 B 34.2
(注意:为了再现性,数据如下。)
到目前为止一切顺利。现在我想通过
mean
获取这些摘要统计数据的移动平均值(在这种情况下为Group
,但也可以是其他数据)。我用zoo::rollmean
试过这个:
grouped <- grouped %>%
group_by(Group) %>%
mutate(rolling = zoo::rollmean(mean, window_length, fill=NA))
但是这里出现了一个问题——理想情况下,移动平均应该严格地是一些days,而不是records,但是一个或两个组都缺少一些天数。
确保移动平均线正确考虑缺失天数 x 组的最佳方法是什么,根据需要将它们视为
NA
?
(我从this answer了解到
zoo::rollmean
无法处理NA
值,但zoo::rollapply
应该能够。)
我已经尝试创建一个简单的日历数据框,其中包含完整的日期集到
join
分组数据,但是这使得Group
变量也为NA
,所以缺失的天数x组仍然被忽略rollmean / rollapply
功能。
希望一切都有意义!
data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680,
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680,
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683,
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683,
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684,
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685,
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687,
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688,
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689,
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690,
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691,
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692,
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694,
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694,
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695,
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696,
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697,
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698,
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701,
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701,
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703), class = "Date"), Group = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"),
Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93,
28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78,
25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76,
29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6,
32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12,
16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69,
39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61,
15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39,
19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25,
34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21,
20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85,
17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87,
19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35,
31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6,
41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49,
20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32,
38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68,
14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03,
29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42,
37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3,
37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92,
15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74,
36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33,
16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33,
38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44,
30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53,
23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75,
19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
)), row.names = c(NA, -250L), class = c("tbl_df", "tbl",
"data.frame"))
假设平均 3 天(当前点和前 2 天)而不是 3 行,并且日期已经在组内排序(问题中就是这种情况),我们计算要使用的行数(这将是一个向量因为每个点可以有不同的行数)并在
rollapplyr
中使用它。在每一行,它对当前行之前的所有行或当前行的所有行进行平均,这些行在当前行之前的 w 天内。这在不添加额外的 NA 行的情况下对原始数据帧执行平均。您可以在 ?rollapply
. 的示例部分中找到其他示例
library(dplyr)
library(zoo)
w <- 3
data %>%
group_by(Group) %>%
mutate(Npoints = 1:n() - findInterval(Date - w, Date),
Mean3 = rollapplyr(Value, Npoints, mean, partial = TRUE, fill = NA)) %>%
ungroup
给予:
# A tibble: 250 × 5
Date Group Value Npoints Mean3
<date> <fct> <dbl> <int> <dbl>
1 2021-02-18 A 37.4 1 37.4
2 2021-02-19 A 26.1 2 31.8
3 2021-02-19 B 25.5 1 25.5
4 2021-02-22 A 31.6 1 31.6
5 2021-02-22 A 27.0 2 29.3
6 2021-02-22 A 15.3 3 24.6
7 2021-02-22 A 35.9 4 27.5
8 2021-02-22 A 28.6 5 27.7
9 2021-02-22 A 17.1 6 25.9
10 2021-02-22 B 30.4 1 30.4
# … with 240 more rows
相反,如果您想包括当前行之前的行,如果它们等于当前行的日期,那么使用它。这里 L 是
rollapply
使用的偏移向量列表,使得 L[[i]]
是在第 i 行使用的偏移向量。
data %>%
group_by(Group) %>%
mutate(L = lapply(1:n(),
\(i) which(Date %in% seq(Date[i] - w, Date[i], "day")) - i),
Mean3 = rollapplyr(Value, L, mean, partial = TRUE, fill = NA)) %>%
ungroup %>%
select(-L)
给予:
# A tibble: 250 × 4
Date Group Value Mean3
<date> <fct> <dbl> <dbl>
1 2021-02-18 A 37.4 37.4
2 2021-02-19 A 26.1 31.8
3 2021-02-19 B 25.5 25.5
4 2021-02-22 A 31.6 26.4
5 2021-02-22 A 27.0 26.4
6 2021-02-22 A 15.3 26.4
7 2021-02-22 A 35.9 26.4
8 2021-02-22 A 28.6 26.4
9 2021-02-22 A 17.1 26.4
10 2021-02-22 B 30.4 32.0
# ℹ 240 more rows
# ℹ Use `print(n = ...)` to see more rows
我写了一个专为这类问题设计的包(timeplyr)。
里面有一个函数
time_complete()
,通过任意时间聚合完成每个组的时间范围。然后,您可以使用任何滚动平均函数。
请参阅下面的示例。
data <- structure(list(Date = structure(c(18676, 18677, 18677, 18680,
18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680, 18680,
18680, 18680, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681, 18681,
18681, 18681, 18681, 18682, 18682, 18682, 18682, 18682, 18683,
18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683, 18683,
18683, 18683, 18683, 18684, 18684, 18684, 18684, 18684, 18684,
18684, 18684, 18684, 18684, 18684, 18685, 18685, 18685, 18685,
18685, 18685, 18685, 18685, 18685, 18685, 18685, 18687, 18687,
18687, 18687, 18687, 18687, 18687, 18687, 18687, 18688, 18688,
18688, 18688, 18688, 18688, 18688, 18688, 18688, 18689, 18689,
18689, 18689, 18689, 18689, 18690, 18690, 18690, 18690, 18690,
18690, 18690, 18690, 18691, 18691, 18691, 18691, 18691, 18691,
18691, 18691, 18691, 18691, 18692, 18692, 18692, 18692, 18692,
18692, 18692, 18692, 18692, 18692, 18692, 18692, 18693, 18694,
18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694, 18694,
18694, 18694, 18695, 18695, 18695, 18695, 18695, 18695, 18695,
18695, 18695, 18696, 18696, 18696, 18696, 18696, 18696, 18696,
18696, 18696, 18697, 18697, 18697, 18697, 18697, 18697, 18697,
18697, 18697, 18698, 18698, 18698, 18698, 18698, 18698, 18698,
18698, 18698, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699, 18699,
18699, 18699, 18699, 18700, 18701, 18701, 18701, 18701, 18701,
18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701, 18701,
18701, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702, 18702,
18702, 18702, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703, 18703,
18703, 18703, 18703), class = "Date"), Group = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L,
1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), levels = c("B", "A"), class = "factor"),
Value = c(37.43, 26.13, 25.54, 31.65, 26.95, 15.29, 35.93,
28.59, 17.14, 30.42, 20.52, 33.4, 35.3, 36.87, 28.32, 21.78,
25.49, 34.13, 20.35, 40.21, 16, 24.58, 23.61, 38.94, 36.76,
29.68, 15.97, 20.79, 17.83, 14.65, 16.76, 35.74, 31.5, 25.6,
32.96, 14.1, 40.4, 24.53, 39.57, 21.38, 14.49, 22.11, 27.12,
16.46, 17.65, 37.32, 15.74, 17.07, 28.52, 14.72, 27.75, 36.69,
39.47, 26.13, 35.57, 24.08, 24.39, 13.1, 16.75, 24.49, 23.61,
15.04, 23.22, 37.3, 36.76, 15.77, 28.34, 35.06, 28.32, 29.39,
19.09, 35.68, 35.9, 37.13, 36.1, 40.55, 33.97, 24.03, 37.25,
34.39, 13.05, 21.64, 40.02, 26.17, 19.39, 25.76, 40.92, 24.21,
20.35, 27.7, 29.53, 14.19, 15.64, 32.74, 31.42, 14.01, 12.85,
17.31, 31.67, 23.63, 17.29, 36.71, 18.19, 17.78, 34.87, 36.87,
19.27, 24.97, 41.66, 16.83, 34.79, 14.94, 34.39, 40.66, 31.35,
31.74, 36.19, 18.28, 37.61, 37.19, 29.58, 17.04, 28.84, 16.6,
41.97, 32.36, 27.91, 21.7, 40.45, 35.38, 41.19, 35.68, 19.49,
20.94, 23.99, 14.28, 39.24, 12.19, 18.02, 39.14, 40.61, 33.32,
38.68, 39.18, 31.76, 22.64, 38.18, 36.75, 30.91, 38.82, 30.68,
14.2, 39.34, 18.91, 12.7, 28.2, 37.85, 34.06, 12.88, 40.03,
29.95, 14.61, 17.01, 35.64, 20.49, 39.51, 29.29, 18.84, 36.42,
37.88, 32.65, 19.7, 19.84, 38.75, 21.25, 40.68, 17.89, 26.3,
37.22, 18.03, 17.33, 36.26, 41.98, 19.4, 20.54, 18.6, 26.92,
15.23, 20.22, 15.2, 35.49, 15.14, 14.43, 30.82, 14.79, 17.74,
36.8, 17.09, 18.09, 19.92, 39.64, 23.87, 22.67, 24.66, 24.33,
16.82, 17.91, 21.66, 30.79, 32.91, 25.16, 38.98, 15.49, 21.33,
38.47, 34.46, 24.22, 36.93, 22.25, 15.33, 41.38, 34.49, 23.44,
30.53, 10.62, 23.8, 28.94, 12.49, 22, 24.51, 14.72, 15.53,
23.23, 38.93, 16.06, 19.36, 35.91, 22.2, 15.85, 33.36, 31.75,
19.69, 29.86, 16.3, 19.42, 19.17, 14.41, 13.18, 20.67, 17.02
)), row.names = c(NA, -250L), class = c("tbl_df", "tbl",
"data.frame"))
remotes::install_github("NicChr/timeplyr")
library(timeplyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data %>%
group_by(Group) %>%
time_complete(time = Date, by = "day") %>%
mutate(mean = frollmean2(Value, n = 7, na.rm = TRUE))
#> # A tibble: 258 x 4
#> # Groups: Group [2]
#> Date Group Value mean
#> <date> <fct> <dbl> <dbl>
#> 1 2021-02-19 B 25.5 25.5
#> 2 2021-02-20 B NA 25.5
#> 3 2021-02-21 B NA 25.5
#> 4 2021-02-22 B 30.4 28.0
#> 5 2021-02-22 B 35.3 30.4
#> 6 2021-02-22 B 36.9 32.0
#> 7 2021-02-23 B 25.5 30.7
#> 8 2021-02-23 B 40.2 33.7
#> 9 2021-02-23 B 38.9 34.5
#> 10 2021-02-23 B 36.8 34.9
#> # ... with 248 more rows
创建于 2023-03-29 与 reprex v2.0.2
library(dplyr)
library(zoo)
# Create a calendar dataframe with the full set of dates
calendar <- data.frame(Date = seq(min(data$Date), max(data$Date), by = "day"))
# join data and calendar by "Date" and "Group" columns
data_full <- full_join(data, calendar, by = c("Date"))
# Group the data by date and group and calculate the summary statistics of the value
grouped <- data_full %>%
group_by(Date, Group) %>%
summarise(mean = mean(Value))
# Group the resulting summary statistics data by group
grouped_by_group <- grouped %>%
group_by(Group)
# Use rollapply() to calculate the moving average for each group separately
window_length <- 7 # the desired number of days for the moving average window
grouped_by_group <- grouped_by_group %>%
mutate(rolling = rollapply(mean, window_length, mean, fill = NA, align = "right"))