我希望从篮球参考中抓取每场比赛的每项得分,如下所示:
https://www.basketball-reference.com/boxscores/202402220CHI.html
表格显示了游戏的不同时期,例如Q1、Q2、Q3、Q4,您可以通过单击各个选项来查看这些:
我的目标是获取每个表格(针对赛季中每一天的每支球队),将它们连接在一起,然后指定表格适用于哪个时期,例如Q1、Q2 等最有可能通过添加“Q1”、“Q2”等列来实现。
到目前为止我的尝试:
library(rvest)
library(tidyverse)
##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
make_url <- function(team, year, month, day) {
paste0(
'https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html'
)
}
dates <- expand.grid(
team = team, year = year, month = month, day = day
)
urls <- dates |>
mutate(
url = make_url(team, year, month, day),
team = team,
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
scrape_table <- function(url) {
page_html <- url %>%
rvest::read_html()
page_html %>%
rvest::html_nodes("table") %>%
rvest::html_table(header = FALSE)
}
safe_scrape_table <- purrr::safely(scrape_table)
tbl_scrape <- purrr::map(urls$url, \(url) {
Sys.sleep(5)
safe_scrape_table(url)
}) |>
set_names(paste(urls$team, urls$date, sep = '-'))
final_result <- tbl_scrape |>
purrr::transpose() |>
pluck('result')
这就是我被困住的地方。我可以看到
list[[1]]
和 list[[9]]
是 Game
输出,list[[2]]
和 list[[10]]
是 Q1
输出,依此类推。
如何才能只获取我需要的并将它们绑定在一起?我只需要 Q1、Q2、Q3 和 Q4。
我还需要添加一列,该列实际上是每个标题的标题,例如“CHI-2024-02-22”,所以我知道这些统计数据与哪款游戏相关。
最后,我希望添加两栏,一栏用于主队,一栏用于客队。我知道这些详细信息出现在每个页面上,但我不知道如何获取它们?
试试这个。在函数“getPageTable”中,我读取了游戏日期、表格标题以及页面上的所有表格。我仅过滤掉第 1、2、3、4 节的方框分数。并删除行中的标题,然后将比赛日期和表格标题添加到表格中,然后合并为每场比赛 1 个数据帧。
请参阅评论了解更多详细信息。
library(rvest)
library(dplyr)
##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
make_url <- function(team, year, month, day) {
paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}
dates <- expand.grid(team = team, year = year, month = month, day = day)
urls <- dates |>
mutate( url = make_url(team, year, month, day),
team = team,
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
getPageTable <- function(url) {
#read the page
page <- read_html(url)
#get the game's date
gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
#get the table headings
headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
#find the quarter scores
quarters <- grep("Q[1|2|3|4]", headings)
#retrieve the tables from the page
tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table")
#select the desired headings and tables
headings <- headings[quarters]
tables <- tables[quarters] %>% html_table(header=FALSE)
#add game date and team name/quater to the results
tables <- lapply(1:length(tables), function(i) {
#set column titles to second row
names(tables[[i]]) <- tables[[i]][2,]
tables[[i]] <- tables[[i]][-c(1:2),]
tables[[i]]$gamedate <- gamedate
tables[[i]]$team <- headings[i]
tables[[i]]
})
#merge the quarterly status into 1 dataframe
df <- bind_rows(tables)
df
}
#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)