R:查找满足条件的最新数据

问题描述 投票:0回答:1

我有一个长格式的数据框,包括比赛、比赛日期以及各自球队是否获胜。它具有以下结构:

   GAME_DATE_EST GAME_ID  TEAM_ID    WIN  
   <date>        <int>    <int>      <lgl>
 1 2015-06-16    41400406 1610612739 FALSE
 2 2015-06-16    41400406 1610612744 TRUE 
 3 2015-06-14    41400405 1610612744 TRUE 
 4 2015-06-14    41400405 1610612739 FALSE
 5 2015-06-11    41400404 1610612739 FALSE
 6 2015-06-11    41400404 1610612744 TRUE 
 7 2015-06-09    41400403 1610612739 TRUE 
 8 2015-06-09    41400403 1610612744 FALSE
 9 2015-06-07    41400402 1610612744 FALSE
10 2015-06-07    41400402 1610612739 TRUE 

对于每一行,我需要满足不同条件的最近的前一场比赛。例如,我想要一个新列,其中包含相应团队最近一次胜利的 GAME_ID,一个包含最近一次失败的 GAME_ID 的新列,等等。

到目前为止,我找到的唯一解决方案是使用此功能并将其与

rowwise()
一起使用,但需要很长时间才能完成:

get_most_recent_win = function(df, team_id, date) {
  temp = subset(df, WIN & TEAM_ID == team_id & GAME_DATE_EST < date,
                select = c("GAME_ID", "GAME_DATE_EST"))
  
  if (nrow(temp) > 0) {
    return(temp[which.max(temp$GAME_DATE_EST), "GAME_ID"])
  } else {
    return(NA)
  }
}

games_longer[1:50, ] %>%
  rowwise() %>% 
  mutate(most_recent_win = get_most_recent_win(., TEAM_ID, GAME_DATE_EST)) %>% 
  select(GAME_DATE_EST, GAME_ID, most_recent_win, TEAM_ID, WIN)

一般来说,解决此类问题最简洁、最有效的方法是什么?

这里有数据供您尝试:

structure(list(GAME_DATE_EST = structure(c(16602, 16602, 16600, 
16600, 16597, 16597, 16595, 16595, 16593, 16593, 16590, 16590, 
16582, 16582, 16581, 16581, 16580, 16580, 16579, 16579), class = "Date"), 
    GAME_ID = c(41400406L, 41400406L, 41400405L, 41400405L, 41400404L, 
    41400404L, 41400403L, 41400403L, 41400402L, 41400402L, 41400401L, 
    41400401L, 41400315L, 41400315L, 41400304L, 41400304L, 41400314L, 
    41400314L, 41400303L, 41400303L), TEAM_ID = c(1610612739L, 
    1610612744L, 1610612744L, 1610612739L, 1610612739L, 1610612744L, 
    1610612739L, 1610612744L, 1610612744L, 1610612739L, 1610612744L, 
    1610612739L, 1610612744L, 1610612745L, 1610612739L, 1610612737L, 
    1610612745L, 1610612744L, 1610612739L, 1610612737L), WIN = c(FALSE, 
    TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, 
    TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, 
    FALSE)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", 
"data.frame"))
r dplyr tidyr lag
1个回答
0
投票

首先确保数据按日期

arrange

library(dplyr)
library(tidyr)

df |> 
    group_by(TEAM_ID) |> 
    mutate(last_win = if_else(lead(WIN), GAME_ID, NA)) |> 
    fill(last_win) |> 
    ungroup()

# A tibble: 20 × 5
   GAME_DATE_EST  GAME_ID    TEAM_ID WIN   last_win
   <date>           <int>      <int> <lgl>    <int>
 1 2015-06-16    41400406 1610612739 FALSE       NA
 2 2015-06-16    41400406 1610612744 TRUE  41400406
 3 2015-06-14    41400405 1610612744 TRUE  41400405
 4 2015-06-14    41400405 1610612739 FALSE       NA
 5 2015-06-11    41400404 1610612739 FALSE 41400404
 6 2015-06-11    41400404 1610612744 TRUE  41400405
 7 2015-06-09    41400403 1610612739 TRUE  41400403
 8 2015-06-09    41400403 1610612744 FALSE 41400405
 9 2015-06-07    41400402 1610612744 FALSE 41400402
10 2015-06-07    41400402 1610612739 TRUE  41400403
11 2015-06-04    41400401 1610612744 TRUE  41400401
12 2015-06-04    41400401 1610612739 FALSE 41400401
13 2015-05-27    41400315 1610612744 TRUE  41400401
14 2015-05-27    41400315 1610612745 FALSE 41400315
15 2015-05-26    41400304 1610612739 TRUE  41400304
16 2015-05-26    41400304 1610612737 FALSE       NA
17 2015-05-25    41400314 1610612745 TRUE  41400315
18 2015-05-25    41400314 1610612744 FALSE 41400401
19 2015-05-24    41400303 1610612739 TRUE  41400304
20 2015-05-24    41400303 1610612737 FALSE       NA
© www.soinside.com 2019 - 2024. All rights reserved.