抓取篮球参考得分并将特定列表元素输出到 R 中的数据框

问题描述 投票:0回答:1

我希望从篮球参考中抓取每场比赛的每项得分,如下所示:

https://www.basketball-reference.com/boxscores/202402220CHI.html

表格显示了游戏的不同时期,例如Q1、Q2、Q3、Q4,您可以通过单击各个选项来查看这些:

我的目标是获取每个表格(针对赛季中每一天的每支球队),将它们连接在一起,然后指定表格适用于哪个时期,例如Q1、Q2 等最有可能通过添加“Q1”、“Q2”等列来实现。

到目前为止我的尝试:


    library(rvest)
    library(tidyverse)
    
    ##sample only - ultimately this will include all teams and all months and days
    
    month <- c('02')
    year <- c('2024')
    day <- c('220','270','280')   
    team <- c('CHI')

    make_url <- function(team, year, month, day) {
      paste0(
        'https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html'
      )
    }
    
    dates <- expand.grid(
      team = team, year = year, month = month, day = day
    )
    
    urls <- dates |>
      mutate(
        url = make_url(team, year, month, day),
        team = team,
        date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
        .keep = 'unused'
      )
    
    
    scrape_table <- function(url) {
      page_html <- url %>%
        rvest::read_html()  
      
      page_html %>%
        rvest::html_nodes("table") %>%
        rvest::html_table(header = FALSE)
      
    }
    
    
    safe_scrape_table <- purrr::safely(scrape_table)
    
    tbl_scrape <- purrr::map(urls$url, \(url) {
      Sys.sleep(5)
      safe_scrape_table(url)
    }) |>
      set_names(paste(urls$team, urls$date, sep = '-'))
    
    
    final_result <- tbl_scrape |>
      purrr::transpose() |>
      pluck('result')

这就是我被困住的地方。我可以看到

list[[1]]
list[[9]]
Game
输出,
list[[2]]
list[[10]]
Q1
输出,依此类推。

如何才能只获取我需要的并将它们绑定在一起?我只需要 Q1、Q2、Q3 和 Q4。

我还需要添加一列,该列实际上是每个标题的标题,例如“CHI-2024-02-22”,所以我知道这些统计数据与哪款游戏相关。

最后,我希望添加两栏,一栏用于主队,一栏用于客队。我知道这些详细信息出现在每个页面上,但我不知道如何获取它们?

r rvest
1个回答
0
投票

试试这个。在函数“getPageTable”中,我读取了游戏日期、表格标题以及页面上的所有表格。我仅过滤掉第 1、2、3、4 节的方框分数。并删除行中的标题,然后将比赛日期和表格标题添加到表格中,然后合并为每场比赛 1 个数据帧。
请参阅评论了解更多详细信息。

library(rvest)
library(dplyr)

##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

make_url <- function(team, year, month, day) {
   paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}

dates <- expand.grid(team = team, year = year, month = month, day = day)

urls <- dates |>
   mutate( url = make_url(team, year, month, day),
      team = team,
      date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
      .keep = 'unused'
   )

getPageTable <- function(url) {
   #read the page
   page <- read_html(url)

   #get the game's date
   gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
   
   #get the table headings
   headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
   
   #find the quarter scores
   quarters <- grep("Q[1|2|3|4]", headings)
   
   #retrieve the tables from the page
   tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 

   #select the desired headings and tables
   headings <- headings[quarters]
   tables <- tables[quarters] %>% html_table(header=FALSE)

   #add game date and team name/quater to the results
   tables <- lapply(1:length(tables), function(i) {
      #set column titles to second row
      names(tables[[i]]) <- tables[[i]][2,]
     tables[[i]] <- tables[[i]][-c(1:2),]  
      tables[[i]]$gamedate <- gamedate
      tables[[i]]$team <- headings[i]
      tables[[i]]
   })
   #merge the quarterly status into 1 dataframe
   df <- bind_rows(tables)
   df
}


#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
© www.soinside.com 2019 - 2024. All rights reserved.