我有一个简单的问题,我只能找到没有吸引力的解决方案。
我有时间序列数据,以天为单位进行分析。在某些日子,会发生一个事件。我想创建一个变量,它指示事件一周后的所有日期,另一个变量表示自本周事件以来已经过了多少天。我在下面列举了一个我想要实现的例子。
date event week_following_event days_since_event
1 2000-01-01 0 0 NA
2 2000-01-02 0 0 NA
3 2000-01-03 1 0 NA
4 2000-01-04 0 1 1
5 2000-01-05 0 1 2
6 2000-01-06 0 1 3
7 2000-01-07 0 1 4
8 2000-01-08 0 1 5
9 2000-01-09 0 1 6
10 2000-01-10 0 1 7
11 2000-01-11 0 0 NA
12 2000-01-12 0 0 NA
13 2000-01-13 0 0 NA
14 2000-01-14 0 0 NA
15 2000-01-15 0 0 NA
我很确定我可以通过写一个循环来做到这一点,但我理想的是寻找一个更整洁的解决方案。
这是用于再现目的的dput()输出:
structure(list(date = structure(c(10957, 10958, 10959, 10960,
10961, 10962, 10963, 10964, 10965, 10966, 10967, 10968, 10969,
10970, 10971), class = "Date"), event = c(0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), week_following_event = c(0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0), days_since_event = c(NA, NA,
NA, 1L, 2L, 3L, 4L, 5L, 6L, 7L, NA, NA, NA, NA, NA)), row.names = c(NA,
-15L), class = "data.frame")
任何输入都非常感谢!
一个基本的R方法,当你有多个events
时也应该工作
#Initialize values
df$week_following_event <- 0
df$days_since_event <- NA
#Get index when event occurs
event_days <- which(df$event == 1)
#Get all the index which needs to be changed
week_following_index <- c(sapply(event_days, function(x) (x + 1):(x + 7)))
#Change the values
df$week_following_event[week_following_index] <- 1
# 1:7 would be recycled in case of multiple events
df$days_since_event[week_following_index] <- 1:7
df
# date event week_following_event days_since_event
#1 2000-01-01 0 0 NA
#2 2000-01-02 0 0 NA
#3 2000-01-03 1 0 NA
#4 2000-01-04 0 1 1
#5 2000-01-05 0 1 2
#6 2000-01-06 0 1 3
#7 2000-01-07 0 1 4
#8 2000-01-08 0 1 5
#9 2000-01-09 0 1 6
#10 2000-01-10 0 1 7
#11 2000-01-11 0 0 NA
#12 2000-01-12 0 0 NA
#13 2000-01-13 0 0 NA
#14 2000-01-14 0 0 NA
#15 2000-01-15 0 0 NA
使用dplyr
的另一种方法:
df = structure(list(date = structure(c(10957, 10958, 10959, 10960, 10961, 10962, 10963, 10964, 10965, 10966, 10967, 10968, 10969, 10970, 10971), class = "Date"),
event = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
week_following_event = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0),
days_since_event = c(NA, NA, NA, 1L, 2L, 3L, 4L, 5L, 6L, 7L, NA, NA, NA, NA, NA)),
row.names = c(NA, -15L), class = "data.frame")
library(dplyr)
# remove columns (output columns)
df = df %>% select(date, event)
df %>%
group_by(group = cumsum(event)) %>% # group rows based on when event occurs
mutate(days_since_event = ifelse(group > 0, row_number()-1, NA), # add days after event only after an event occured
days_since_event = ifelse(between(days_since_event,1,7), days_since_event, NA), # keep only up to a week after the event
week_following_event = ifelse(is.na(days_since_event), 0, 1)) %>% # add flag for days up to a week after an event
ungroup() %>%
select(-group)
哪个回报:
# # A tibble: 15 x 4
# date event days_since_event week_following_event
# <date> <dbl> <dbl> <dbl>
# 1 2000-01-01 0 NA 0
# 2 2000-01-02 0 NA 0
# 3 2000-01-03 1 NA 0
# 4 2000-01-04 0 1 1
# 5 2000-01-05 0 2 1
# 6 2000-01-06 0 3 1
# 7 2000-01-07 0 4 1
# 8 2000-01-08 0 5 1
# 9 2000-01-09 0 6 1
#10 2000-01-10 0 7 1
#11 2000-01-11 0 NA 0
#12 2000-01-12 0 NA 0
#13 2000-01-13 0 NA 0
#14 2000-01-14 0 NA 0
#15 2000-01-15 0 NA 0
这是data.table
的一个选项。将'data.frame'转换为'data.table'(setDT
),在'event'为1('i1')之后获取接下来7行的行索引,使用该索引,将'week_following_event'创建为1 (其他人默认为NA),按'week_following_event'中非NA元素的rleid
分组,创建'days_since_event'作为行序列
library(data.table)
i1 <- setDT(df)[, sort(unique(pmin(rep(.I[event == 1], each = 7) + 1:7, .N)))]
df[i1, week_following_event := 1
][, days_since_event := seq_len(.N) * week_following_event,
rleid(!is.na(week_following_event))
]#[is.na(week_following_event), week_following_event := 0][] # if needed
# date event week_following_event days_since_event
# 1: 2000-01-01 0 NA NA
# 2: 2000-01-02 0 NA NA
# 3: 2000-01-03 1 NA NA
# 4: 2000-01-04 0 1 1
# 5: 2000-01-05 0 1 2
# 6: 2000-01-06 0 1 3
# 7: 2000-01-07 0 1 4
# 8: 2000-01-08 0 1 5
# 9: 2000-01-09 0 1 6
#10: 2000-01-10 0 1 7
#11: 2000-01-11 0 NA NA
#12: 2000-01-12 0 NA NA
#13: 2000-01-13 0 NA NA
#14: 2000-01-14 0 NA NA
#15: 2000-01-15 0 NA NA
df <- structure(list(date = structure(c(10957, 10958, 10959, 10960,
10961, 10962, 10963, 10964, 10965, 10966, 10967, 10968, 10969,
10970, 10971), class = "Date"), event = c(0, 0, 1, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, -15L), class = "data.frame")