使用 read_html 在 R 中读取时处理 404 和其他错误 URL

问题描述 投票:0回答:1

总结:使用

trycatch
和 R 的
read_html
函数处理错误和坏页。

我们正在使用 Rs

read_html
功能连接到一些 NCAA 体育网站,需要识别页面何时出现错误。以下是一些错误页面的示例 URL:

 - www.newburynighthawks.com (does not exist)
 - http://www.clarkepride.com/sports/womens-basketball/roster/2020-21 (404 not found)
 - https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19 (not found)
 - www.lambuth.edu/athletics/index.html (does not exist)
 - https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19 (page not found)

使用

read_html
时,每个 URL 都有自己的问题/问题。为了处理这些问题,我编写了一个函数,在以下情况下使用
trycatch
检查这些页面的有效性:

check_url_validity <- function(this_url) {
  good_url = FALSE

  # go to url to check for a rosters page
  bad_page_titles = c('Page Not Found', 'Page not found', '404')
  result = tryCatch({
    team_page <- this_url %>% GET(., timeout(2)) %>% read_html
    team_page_title <- team_page %>% html_nodes('title') %>% html_text
    team_page_body <- team_page %>% html_nodes('body') %>% html_text
    good_page <- !grepl('Page not found', team_page_title) &&
      !grepl('Page Not Found', team_page_title) &&
      !grepl('404', team_page_title) &&
      team_page_title != "" &&
      !grepl('Error 404', team_page_body)
    
    if(good_page) { good_url = TRUE }
  }, error = function(e) { NA })
  
  return(good_url)
}

在上面链接的网址上测试此功能可提供以下结果:

these_urls = c(
'www.newburynighthawks.com', 
'http://www.clarkepride.com/sports/womens-basketball/roster/2020-21',
'https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19',
'www.lambuth.edu/athletics/index.html',
'https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19'
)

for (this_url in these_urls) {
  print(check_rosters_url(this_url))
}

其中一些页面 (

http://www.newburynighthawks.com/
) 在
trycatch
中很容易被识别为不良页面,因为没有页面。其他人 (
http://www.clarkepride.com/sports/womens-basketball/roster/2020-21
) 依靠正文中的字符串匹配来发现页面有问题。总体问题是,这是一个 hacky 解决方案,我们在这里处理 ~1000 个不同的 URL,并且我们继续向代码行添加条件,以确定
good_page
是 TRUE 还是 FALSE。目前我们最多有 5 个条件,其中大多数使用
grepl
来字符串匹配标题和正文中的短语,例如
404
Not Found

有没有比正文中的

404
Not Found
字符串匹配更好的解决方案,知道这些页面不是好页面?

r error-handling try-catch rvest
1个回答
1
投票

下面的代码不会尝试读取页面内容,而是使用包

httr
发出HEAD请求。这更快并返回所有必要的信息。

library(httr)

check_url_validity <- function(this_url){
  r <- tryCatch(httr::HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    "does not exist"
    #conditionMessage(r)
  } else {
    httr::http_status(r)$reason
  }
}

lapply(urls_vec, check_url_validity)
#[[1]]
#[1] "does not exist"
#
#[[2]]
#[1] "Not Found"
#
#[[3]]
#[1] "Not Found"
#
#[[4]]
#[1] "does not exist"
#
#[[5]]
#[1] "OK"

要返回

NA/FALSE/TRUE
,下面的函数遵循相同的行。

check_url_validity2 <- function(this_url){
  r <- tryCatch(httr::HEAD(this_url),
                error = function(e) e
  )
  if(inherits(r, "error")){
    NA
  }else{
    httr::status_code(r) < 300
  }
}

lapply(urls_vec, check_url_validity2)
#[[1]]
#[1] NA
#
#[[2]]
#[1] FALSE
#
#[[3]]
#[1] FALSE
#
#[[4]]
#[1] NA
#
#[[5]]
#[1] TRUE

数据

urls_vec <- c(
  "www.newburynighthawks.com", 
  "http://www.clarkepride.com/sports/womens-basketball/roster/2020-21", 
  "https://lyon.edu/sports/lyon_sports.html/sports/mens-basketball/roster/2018-19", 
  "www.lambuth.edu/athletics/index.html", 
  "https://uvi.edu/pub-relations/athletics/athletics.htm/sports/womens-basketball/roster/2018-19"
)
© www.soinside.com 2019 - 2024. All rights reserved.