我正在分析几年来某些地区发生的某些天气事件。我的数据框看起来像这样:
library(tidyverse)
df <- tibble(region = c(rep("A", 10), rep("B", 10)),
event = c(rep("storm", 5), rep("rain", 5), rep("storm", 5), rep("rain", 5)),
year = c(rep(1:5, 4)),
occured = c("n", "n", "y", "n", "y",
"y", "n", "n", "n", "y",
"n", "y", "y", "y", "y",
"n", "n", "n", "n", "y")
)
我现在想创建一个指示变量,告诉我某个天气事件是否在 x 年来首次在某个地区发生。
假设我对过去两年感兴趣,那么以下代码可以实现这一点:
df %>%
group_by(region, event) %>%
mutate(ind = case_when(
occured == "y" &
lag(occured, 1, default = "n") == "n" &
lag(occured, 2, default = "n") == "n" ~ 1,
.default = 0
)
)
# A tibble: 20 × 5
# Groups: region, event [4]
region event year occured ind
<chr> <chr> <int> <chr> <dbl>
1 A storm 1 n 0
2 A storm 2 n 0
3 A storm 3 y 1
4 A storm 4 n 0
5 A storm 5 y 0
6 A rain 1 y 1
7 A rain 2 n 0
8 A rain 3 n 0
9 A rain 4 n 0
10 A rain 5 y 1
11 B storm 1 n 0
12 B storm 2 y 1
13 B storm 3 y 0
14 B storm 4 y 0
15 B storm 5 y 0
16 B rain 1 n 0
17 B rain 2 n 0
18 B rain 3 n 0
19 B rain 4 n 0
20 B rain 5 y 1
我的问题:谁能告诉我一种更有效、更灵活的创建此指示变量的方法?
如果有一个解决方案不需要我根据我感兴趣的年数添加或删除多个滞后语句,那就太好了。
非常感谢!
我们可以使用
slide_dbl
来计算滚动窗口中“y”值的数量,并检查它在当前行中是否只有一次。
df %>%
mutate(ind = 1 * (occured == "y" &
slider::slide_dbl(occured == "y", sum, .before = 2) == 1),
.by = c(region, event))