我定期抓取美国国务院的新闻稿,但它突然响应禁止访问,我从不同的计算机和云平台尝试但结果是一样的。此外,我不允许访问该站点的 RSS 提要:text
下面是我的代码:
options(stringsAsFactors = F) # no automatic data transformation
options("scipen" = 999, "digits" = 4) # suppress math annotation
library(tidyverse)
library(rvest)
library(readtext)
library(webdriver)
library(gsubfn)
library(lubridate)
library(stringr)
library(readxl)
library(urltools)
library(crul)
############################********Department_of_State*****##################################
HEADERS <- list(
"User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36"
)
US_Gov=tibble(url="https://www.state.gov/press-releases/page/",page=1)
US_Gov=rbind(US_Gov, US_Gov[rep(1, 4), ])
US_Gov=US_Gov %>% mutate(page=row_number())
US_Gov$page=paste0(US_Gov$url,US_Gov$page,"/")
df=US_Gov$page
ttt=lapply(df,function(x){
HttpClient$new(x, header=HEADERS)$get()
})
t4=lapply(ttt,function(x){
read_html(x$parse())
})
links=lapply(t4,function(x){
html_attr(html_nodes(x, "a"), "href")
})
class=lapply(t4,function(x){
html_attr(html_nodes(x, "a"), "class")
})
US_Links=plyr::ldply(links, cbind)
US_Class=plyr::ldply(class, cbind)
US_Data=bind_cols(US_Links,US_Class)
Rvest 403 访问被拒绝
一旦/如果你脱离了他们的黑名单,试着放慢速度。
httr2::req_throttle()
例如提供了一种方便的方法来包括对 dplyr 管道的节流 + 在交互式会话中显示进度条。
并且仅在标头中设置user-agent
可能还不够,我没有费心尝试所有组合,但是仅跳过几个随机heder值导致4xx错误。
library(httr2)
library(rvest)
library(purrr)
press_releases <- paste0('page/',1:5,'/') %>%
set_names() %>%
map(\(page){
request(paste0("https://www.state.gov/press-releases/", page)) %>%
req_headers(
authority = "www.state.gov",
accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
`accept-language` = "en-GB,en;q=0.9",
`cache-control` = "max-age=0",
`sec-fetch-dest` = "document",
`sec-fetch-mode` = "navigate",
`sec-fetch-site` = "none",
`sec-fetch-user` = "?1",
`upgrade-insecure-requests` = "1",
`user-agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36",
) %>%
# 5s dealy
req_throttle(rate = 12 / 60) %>%
req_perform() %>%
resp_body_html()
})
press_releases %>%
map(\(html_doc) html_doc %>%
html_elements("a.collection-result__link") %>%
html_text() %>%
trimws() %>%
tibble::enframe(name = "link_n")) %>%
list_rbind(names_to = "page")
前5页的结果:
#> # A tibble: 50 × 3
#> page link_n value
#> <chr> <int> <chr>
#> 1 page/1/ 1 Taiwan as an Observer in the World Health Assembly
#> 2 page/1/ 2 Joint Statement on the 2nd U.S.-Netherlands Cyber Dialogue
#> 3 page/1/ 3 United States Sanctions Additional Sinaloa Cartel Network of …
#> 4 page/1/ 4 Assistant Secretary of State for International Organization A…
#> 5 page/1/ 5 Secretary Blinken to Deliver Remarks at the World Food Prize …
#> 6 page/1/ 6 Secretary Blinken’s Call with Israeli Minister of Foreign Aff…
#> 7 page/1/ 7 The Attack on AHA Centre Convoy
#> 8 page/1/ 8 Deputy Secretary Sherman’s Call with European External Action…
#> 9 page/1/ 9 Secretary Blinken’s Meeting with North Macedonia’s Prime Mini…
#> 10 page/1/ 10 Secretary Antony J. Blinken and North Macedonia’s Prime Minis…
#> # ℹ 40 more rows
顺便说一句,一些美国政府网站往往对有些令人惊讶的地区有地理限制,例如,我从未能够从我的欧盟位置(3-4 个不同的位置/服务提供商)直接访问人口普查局数据。