这是我问题的后续这里。
提供的代码确实给出了所需的输出,但是当页面不存在时似乎存在问题,我正在尝试使用 try/catch 来避免这些错误并继续。
例如,我使用以下内容指定所有日期:
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
这很好,因为芝加哥这些天都在打主场比赛,因此以下网址都有效:
https://www.basketball-reference.com/boxscores/202402220CHI.html
https://www.basketball-reference.com/boxscores/202402270CHI.html
https://www.basketball-reference.com/boxscores/202402280CHI.html
但是如果我像这样添加另一天和/或一个月:
month <- c('02')
year <- c('2024')
day <- c(**'210'**,'220','270','280')
team <- c('CHI')
芝加哥在 2 月 21 日没有进行主场比赛,并且此网址不存在:
https://www.basketball-reference.com/boxscores/202402210CHI.html
我尝试将其添加到此处的代码中:
page <- tryCatch(read_html(url), error = function(err) "error 404")
但后来我收到了这条消息:
no applicable method for 'xml_find_first' applied to an object of class "character"
如何跳过不存在的页面并仅返回存在的页面的值?
完整代码:
library(rvest)
library(dplyr)
library(tidyr)
##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')
team <- c('CHI')
make_url <- function(team, year, month, day) {
paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}
dates <- expand.grid(team = team, year = year, month = month, day = day)
urls <- dates |>
mutate( url = make_url(team, year, month, day),
team = team,
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
getPageTable <- function(url) {
#read the page
page <- read_html(url)
#get the game's date
gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
#get game title
gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
#get the table headings
headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
#find the quarter scores
quarters <- grep("Q[1|2|3|4]", headings)
#retrieve the tables from the page
tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table")
#select the desired headings and tables
headings <- headings[quarters]
tables <- tables[quarters] %>% html_table(header=FALSE)
#add game date and team name/quater to the results
tables <- lapply(1:length(tables), function(i) {
#set column titles to second row
names(tables[[i]]) <- tables[[i]][2,]
tables[[i]] <- tables[[i]][-c(1:2),]
tables[[i]]$gamedate <- gamedate
tables[[i]]$team <- headings[i]
tables[[i]]$title <- gameInfo
tables[[i]]
})
#merge the quarterly status into 1 dataframe
df <- bind_rows(tables)
df <- df %>% filter("Starters" != "Reserves" | "Starters" != "Team Totals" )
df
}
#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
finalResult <- finalResult %>% separate("team", into=c("team", "quarter"), " \\(")
finalResult$quarter <- sub("\\)", "", finalResult$quarter)
这是一个解决方案。将对
read_html
的调用包装在 tryCatch
中,并在出现任何问题时返回错误条件。然后在读取指令后立即测试条件。像这样,您将获得一个包含数据(URL 正常)和错误(URL 不正常)的列表,并且可以测试哪些在函数之外。
这是已更正的函数。
getPageTable <- function(url) {
# read the page, returning the error condition if error 404 (or other)
page <- tryCatch(
read_html(url),
error = function(e) e
)
if(inherits(page, "error")) {
return(page)
}
# then continue as in the question's code
#get the game's date
gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
#get game title
gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
#get the table headings
headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
#find the quarter scores
quarters <- grep("Q[1|2|3|4]", headings)
#retrieve the tables from the page
tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table")
#select the desired headings and tables
headings <- headings[quarters]
tables <- tables[quarters] %>% html_table(header=FALSE)
#add game date and team name/quater to the results
tables <- lapply(1:length(tables), function(i) {
#set column titles to second row
names(tables[[i]]) <- tables[[i]][2,]
tables[[i]] <- tables[[i]][-c(1:2),]
tables[[i]]$gamedate <- gamedate
tables[[i]]$team <- headings[i]
tables[[i]]$title <- gameInfo
tables[[i]]
})
#merge the quarterly status into 1 dataframe
df <- bind_rows(tables)
df <- df %>% filter("Starters" != "Reserves" | "Starters" != "Team Totals" )
df
}
调用上面的函数,检查有效的返回值并决定如何处理错误。在这种情况下,点击错误的 URL,相应的错误就会以消息的形式打印出来。
#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
# get which weren't read in
err <- sapply(dfs, inherits, what = "error")
# optional, make a list of the bad ones
dfs_err <- dfs[err]
# and print the URL's and error messages
for(i in which(err)) {
urls$url[i] %>% message()
dfs_err[[i]] %>%
conditionMessage() %>%
message()
}
# these are the good ones and the rest of the code is like in the question
dfs <- dfs[!err]
#merge into one big table
finalResult <- bind_rows(dfs)