我有一个长格式的数据框,包括比赛、比赛日期以及各自球队是否获胜。它具有以下结构:
GAME_DATE_EST GAME_ID TEAM_ID WIN
<date> <int> <int> <lgl>
1 2015-06-16 41400406 1610612739 FALSE
2 2015-06-16 41400406 1610612744 TRUE
3 2015-06-14 41400405 1610612744 TRUE
4 2015-06-14 41400405 1610612739 FALSE
5 2015-06-11 41400404 1610612739 FALSE
6 2015-06-11 41400404 1610612744 TRUE
7 2015-06-09 41400403 1610612739 TRUE
8 2015-06-09 41400403 1610612744 FALSE
9 2015-06-07 41400402 1610612744 FALSE
10 2015-06-07 41400402 1610612739 TRUE
对于每一行,我需要满足不同条件的最近的前一场比赛。例如,我想要一个新列,其中包含相应团队最近一次胜利的 GAME_ID,一个包含最近一次失败的 GAME_ID 的新列,等等。
到目前为止,我找到的唯一解决方案是使用此功能并将其与
rowwise()
一起使用,但需要很长时间才能完成:
get_most_recent_win = function(df, team_id, date) {
temp = subset(df, WIN & TEAM_ID == team_id & GAME_DATE_EST < date,
select = c("GAME_ID", "GAME_DATE_EST"))
if (nrow(temp) > 0) {
return(temp[which.max(temp$GAME_DATE_EST), "GAME_ID"])
} else {
return(NA)
}
}
games_longer[1:50, ] %>%
rowwise() %>%
mutate(most_recent_win = get_most_recent_win(., TEAM_ID, GAME_DATE_EST)) %>%
select(GAME_DATE_EST, GAME_ID, most_recent_win, TEAM_ID, WIN)
一般来说,解决此类问题最简洁、最有效的方法是什么?
这里有数据供您尝试:
structure(list(GAME_DATE_EST = structure(c(16602, 16602, 16600,
16600, 16597, 16597, 16595, 16595, 16593, 16593, 16590, 16590,
16582, 16582, 16581, 16581, 16580, 16580, 16579, 16579), class = "Date"),
GAME_ID = c(41400406L, 41400406L, 41400405L, 41400405L, 41400404L,
41400404L, 41400403L, 41400403L, 41400402L, 41400402L, 41400401L,
41400401L, 41400315L, 41400315L, 41400304L, 41400304L, 41400314L,
41400314L, 41400303L, 41400303L), TEAM_ID = c(1610612739L,
1610612744L, 1610612744L, 1610612739L, 1610612739L, 1610612744L,
1610612739L, 1610612744L, 1610612744L, 1610612739L, 1610612744L,
1610612739L, 1610612744L, 1610612745L, 1610612739L, 1610612737L,
1610612745L, 1610612744L, 1610612739L, 1610612737L), WIN = c(FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE,
FALSE)), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
首先确保数据按日期
arrange
:
library(dplyr)
library(tidyr)
df |>
group_by(TEAM_ID) |>
mutate(last_win = if_else(lead(WIN), GAME_ID, NA)) |>
fill(last_win) |>
ungroup()
# A tibble: 20 × 5
GAME_DATE_EST GAME_ID TEAM_ID WIN last_win
<date> <int> <int> <lgl> <int>
1 2015-06-16 41400406 1610612739 FALSE NA
2 2015-06-16 41400406 1610612744 TRUE 41400406
3 2015-06-14 41400405 1610612744 TRUE 41400405
4 2015-06-14 41400405 1610612739 FALSE NA
5 2015-06-11 41400404 1610612739 FALSE 41400404
6 2015-06-11 41400404 1610612744 TRUE 41400405
7 2015-06-09 41400403 1610612739 TRUE 41400403
8 2015-06-09 41400403 1610612744 FALSE 41400405
9 2015-06-07 41400402 1610612744 FALSE 41400402
10 2015-06-07 41400402 1610612739 TRUE 41400403
11 2015-06-04 41400401 1610612744 TRUE 41400401
12 2015-06-04 41400401 1610612739 FALSE 41400401
13 2015-05-27 41400315 1610612744 TRUE 41400401
14 2015-05-27 41400315 1610612745 FALSE 41400315
15 2015-05-26 41400304 1610612739 TRUE 41400304
16 2015-05-26 41400304 1610612737 FALSE NA
17 2015-05-25 41400314 1610612745 TRUE 41400315
18 2015-05-25 41400314 1610612744 FALSE 41400401
19 2015-05-24 41400303 1610612739 TRUE 41400304
20 2015-05-24 41400303 1610612737 FALSE NA