使用 rvest 从多个页面抓取时避免 404 错误

Question

这是我问题的后续这里。

提供的代码确实给出了所需的输出，但是当页面不存在时似乎存在问题，我正在尝试使用 try/catch 来避免这些错误并继续。

例如，我使用以下内容指定所有日期：

month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

这很好，因为芝加哥这些天都在打主场比赛，因此以下网址都有效：

https://www.basketball-reference.com/boxscores/202402220CHI.html

https://www.basketball-reference.com/boxscores/202402270CHI.html

https://www.basketball-reference.com/boxscores/202402280CHI.html

但是如果我像这样添加另一天和/或一个月：

month <- c('02')
year <- c('2024')
day <- c(**'210'**,'220','270','280')   
team <- c('CHI')

芝加哥在 2 月 21 日没有进行主场比赛，并且此网址不存在：

https://www.basketball-reference.com/boxscores/202402210CHI.html

我尝试将其添加到此处的代码中：

page <- tryCatch(read_html(url), error = function(err) "error 404")

但后来我收到了这条消息：

no applicable method for 'xml_find_first' applied to an object of class "character"

如何跳过不存在的页面并仅返回存在的页面的值？

完整代码：

library(rvest)
library(dplyr)
library(tidyr)

##sample only - ultimately this will include all teams and all months and days
month <- c('02')
year <- c('2024')
day <- c('220','270','280')   
team <- c('CHI')

make_url <- function(team, year, month, day) {
   paste0('https://www.basketball-reference.com/boxscores/', year, month, day, team, '.html')
}

dates <- expand.grid(team = team, year = year, month = month, day = day)

urls <- dates |>
   mutate( url = make_url(team, year, month, day),
      team = team,
      date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
      .keep = 'unused'
   )

getPageTable <- function(url) {
   #read the page
   page <- read_html(url)

   #get the game's date
   gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
   
   #get game title
   gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
   #get the table headings
   headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
   
   #find the quarter scores
   quarters <- grep("Q[1|2|3|4]", headings)
   
   #retrieve the tables from the page
   tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 

   #select the desired headings and tables
   headings <- headings[quarters]
   tables <- tables[quarters] %>% html_table(header=FALSE)

   #add game date and team name/quater to the results
   tables <- lapply(1:length(tables), function(i) {
      #set column titles to second row
      names(tables[[i]]) <- tables[[i]][2,]
     tables[[i]] <- tables[[i]][-c(1:2),]  
      tables[[i]]$gamedate <- gamedate
      tables[[i]]$team <- headings[i]
      tables[[i]]$title <- gameInfo
      tables[[i]]
   })
   #merge the quarterly status into 1 dataframe
   df <- bind_rows(tables)
   df <- df %>% filter("Starters" != "Reserves"  | "Starters" != "Team Totals" )
   df
}


#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
#merge into one big table
finalResult <- bind_rows(dfs)
finalResult <- finalResult %>% separate("team", into=c("team", "quarter"), " \\(")
finalResult$quarter <- sub("\\)", "", finalResult$quarter)

Answer 1

这是一个解决方案。将对

read_html

的调用包装在

tryCatch

中，并在出现任何问题时返回错误条件。然后在读取指令后立即测试条件。像这样，您将获得一个包含数据（URL 正常）和错误（URL 不正常）的列表，并且可以测试哪些在函数之外。

这是已更正的函数。

getPageTable <- function(url) {
  # read the page, returning the error condition if error 404 (or other)
  page <- tryCatch(
    read_html(url),
    error = function(e) e
  )
  if(inherits(page, "error")) {
    return(page)
  }
  # then continue as in the question's code 
  #get the game's date
  gamedate <- page %>% html_element("div.scorebox_meta div") %>% html_text2()
  
  #get game title
  gameInfo <- page %>% html_elements("div.box h1") %>% html_text()
  #get the table headings
  headings <- page %>% html_elements("div.section_wrapper") %>% html_element("h2") %>% html_text()
  
  #find the quarter scores
  quarters <- grep("Q[1|2|3|4]", headings)
  
  #retrieve the tables from the page
  tables <- page %>% html_elements("div.section_wrapper") %>% html_element("table") 
  
  #select the desired headings and tables
  headings <- headings[quarters]
  tables <- tables[quarters] %>% html_table(header=FALSE)
  
  #add game date and team name/quater to the results
  tables <- lapply(1:length(tables), function(i) {
    #set column titles to second row
    names(tables[[i]]) <- tables[[i]][2,]
    tables[[i]] <- tables[[i]][-c(1:2),]  
    tables[[i]]$gamedate <- gamedate
    tables[[i]]$team <- headings[i]
    tables[[i]]$title <- gameInfo
    tables[[i]]
  })
  #merge the quarterly status into 1 dataframe
  df <- bind_rows(tables)
  df <- df %>% filter("Starters" != "Reserves"  | "Starters" != "Team Totals" )
  df
}

调用上面的函数，检查有效的返回值并决定如何处理错误。在这种情况下，点击错误的 URL，相应的错误就会以消息的形式打印出来。

#loop through the URLS
dfs <- lapply(urls$url, getPageTable)
# get which weren't read in
err <- sapply(dfs, inherits, what = "error")
# optional, make a list of the bad ones
dfs_err <- dfs[err]
# and print the URL's and error messages
for(i in which(err)) {
  urls$url[i] %>% message()
  dfs_err[[i]] %>%
    conditionMessage() %>%
    message()
}

# these are the good ones and the rest of the code is like in the question
dfs <- dfs[!err]
#merge into one big table
finalResult <- bind_rows(dfs)

使用 rvest 从多个页面抓取时避免 404 错误

问题描述投票：0回答：1

1个回答

最新问题

使用 rvest 从多个页面抓取时避免 404 错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1