尝试使用 R 抓取许多 pdf。 我找到了多个关于如何执行此操作的示例(这是一个;这是另一个),但我找不到一种方法来做到这一点。 我想从以下主站点下载文件 https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm 并且在特定年份内,例如 2018 年 https://www.federalreserve.gov/monetarypolicy/fomchistorical2018.htm
我需要 Beige book、Tealbook A 和声明的 pdf 文件。
我已经通过多种方式尝试过这一点。 第一次尝试是修改第一个链接
library(tidyverse)
library(rvest)
url <- "https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm"
page <- read_html(url)
urls_pdf <- page %>%
html_elements("a") %>%
html_attr("href") %>%
str_subset("\\.pdf")
urls_pdf[1:3] %>% walk2(basename(.), download.file, mode = "wb")
dir(pattern = "\\.pdf")
但我什么也没得到。
第二次我尝试循环,找出一些青色书 A 日期的模式
# Initialize list to store links for tealbook A reports
tealA <- list()
# Generate links for tealbook A reports
for (i in seq_along(fomc_dates)) {
this_fomc <- fomc_dates[i]
this_teal_A <- this_fomc - days(12)
link <- paste0("https://www.federalreserve.gov/monetarypolicy/files/FOMC", format(this_fomc, "%Y%m%d"), "tealbooka", format(this_teal_A, "%Y%m%d"), ".pdf")
tealA[[i]] <- link
}
问题是这种模式并不适用于所有链接,因此它只适用于某些链接。 任何关于如何以最自动化的方式做到这一点的想法将不胜感激!
这不是最优雅的方式,但它完成了
generate_links <- function(start_year, end_year) {
links <- character()
for (year in start_year:end_year) {
links <- c(links, paste0("https://www.federalreserve.gov/monetarypolicy/fomchistorical", year, ".htm"))
}
return(links)
}
# Example: Generate links from 1970 to 2018
start_year <- 2017
end_year <- 2018
url_links <- generate_links(start_year, end_year)
base_url="https://www.federalreserve.gov"
# Define the URL of the webpage
no_urls=length(url_links)
for (this_url in 1:no_urls) {
current_url=url_links[this_url]
# Read the HTML content of the webpage
page <- read_html(current_url)
# Extract all links from the webpage
links <- page %>%
html_elements("a") %>%
html_attr("href")
# Filter out links that contain "beige book", "tealbooka", or "statement"
pdf_links <- grep("(BeigeBook|tealbooka|statement)", links, ignore.case = TRUE, value = TRUE)
# Filter out links that point to PDF files
pdf_links <- grep("\\.pdf$", links, value = TRUE)
# Function to download PDF files
download_pdfs <- function(links, output_directory) {
# Create the output directory if it doesn't exist
if (!dir.exists(output_directory)) {
dir.create(output_directory, recursive = TRUE)
}
# Loop over each link and download the corresponding PDF file
for (link in pdf_links) {
file_name <- paste0(output_directory, "/", basename(link))
this_link=paste0(base_url,link)
response <- httr::GET(this_link)
if (httr::status_code(response) == 200) {
bin_data <- httr::content(response, "raw")
writeBin(bin_data, file_name)
cat("Downloaded:", file_name, "\n")
} else {
cat("Failed to download:", this_link, "\n")
}
}
}
}