Rvest 抓取美国 DOS 网站返回 403 禁止访问

问题描述 投票:0回答:1

我定期抓取美国国务院的新闻稿,但它突然响应禁止访问,我从不同的计算机和云平台尝试但结果是一样的。此外,我不允许访问该站点的 RSS 提要:text

下面是我的代码:

options(stringsAsFactors = F)         # no automatic data transformation
options("scipen" = 999, "digits" = 4) # suppress math annotation
library(tidyverse)
library(rvest)
library(readtext)
library(webdriver)
library(gsubfn)
library(lubridate)
library(stringr)
library(readxl)
library(urltools)
library(crul)
############################********Department_of_State*****##################################

HEADERS <- list(
  "User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
  
)


US_Gov=tibble(url="https://www.state.gov/press-releases/page/",page=1)
US_Gov=rbind(US_Gov, US_Gov[rep(1, 4), ])
US_Gov=US_Gov %>% mutate(page=row_number())
US_Gov$page=paste0(US_Gov$url,US_Gov$page,"/")

df=US_Gov$page


ttt=lapply(df,function(x){
  
  HttpClient$new(x, header=HEADERS)$get()
})



t4=lapply(ttt,function(x){
  read_html(x$parse())
})

links=lapply(t4,function(x){
  html_attr(html_nodes(x, "a"), "href")
  
})

class=lapply(t4,function(x){
  html_attr(html_nodes(x, "a"), "class")
  
})




US_Links=plyr::ldply(links, cbind)
US_Class=plyr::ldply(class, cbind)


US_Data=bind_cols(US_Links,US_Class)

Rvest 403 访问被拒绝

r web-scraping rvest
1个回答
0
投票

一旦/如果你脱离了他们的黑名单,试着放慢速度。

httr2::req_throttle()
例如提供了一种方便的方法来包括对 dplyr 管道的节流 + 在交互式会话中显示进度条。 并且仅在标头中设置
user-agent
可能还不够,我没有费心尝试所有组合,但是仅跳过几个随机heder值导致4xx错误。

library(httr2)
library(rvest)
library(purrr)

press_releases <- paste0('page/',1:5,'/') %>% 
  set_names() %>% 
  map(\(page){
    request(paste0("https://www.state.gov/press-releases/", page)) %>% 
    req_headers(
      authority = "www.state.gov",
      accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
      `accept-language` = "en-GB,en;q=0.9",
      `cache-control` = "max-age=0",
      `sec-fetch-dest` = "document",
      `sec-fetch-mode` = "navigate",
      `sec-fetch-site` = "none",
      `sec-fetch-user` = "?1",
      `upgrade-insecure-requests` = "1",
      `user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
    ) %>% 
    # 5s dealy
    req_throttle(rate = 12 / 60) %>% 
    req_perform() %>% 
    resp_body_html()
  })

press_releases %>% 
  map(\(html_doc) html_doc %>% 
        html_elements("a.collection-result__link") %>% 
        html_text() %>% 
        trimws() %>% 
        tibble::enframe(name = "link_n")) %>% 
  list_rbind(names_to = "page") 

前5页的结果:

#> # A tibble: 50 × 3
#>    page    link_n value                                                         
#>    <chr>    <int> <chr>                                                         
#>  1 page/1/      1 Taiwan as an Observer in the World Health Assembly            
#>  2 page/1/      2 Joint Statement on the 2nd U.S.-Netherlands Cyber Dialogue    
#>  3 page/1/      3 United States Sanctions Additional Sinaloa Cartel Network of …
#>  4 page/1/      4 Assistant Secretary of State for International Organization A…
#>  5 page/1/      5 Secretary Blinken to Deliver Remarks at the World Food Prize …
#>  6 page/1/      6 Secretary Blinken’s Call with Israeli Minister of Foreign Aff…
#>  7 page/1/      7 The Attack on AHA Centre Convoy                               
#>  8 page/1/      8 Deputy Secretary Sherman’s Call with European External Action…
#>  9 page/1/      9 Secretary Blinken’s Meeting with North Macedonia’s Prime Mini…
#> 10 page/1/     10 Secretary Antony J. Blinken and North Macedonia’s Prime Minis…
#> # ℹ 40 more rows

顺便说一句,一些美国政府网站往往对有些令人惊讶的地区有地理限制,例如,我从未能够从我的欧盟位置(3-4 个不同的位置/服务提供商)直接访问人口普查局数据。

© www.soinside.com 2019 - 2024. All rights reserved.