用R多次刮擦

问题描述 投票:0回答:1

尝试使用 R 抓取许多 pdf。 我找到了多个关于如何执行此操作的示例(这是一个这是另一个),但我找不到一种方法来做到这一点。 我想从以下主站点下载文件 https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm 并且在特定年份内,例如 2018 年 https://www.federalreserve.gov/monetarypolicy/fomchistorical2018.htm

我需要 Beige book、Tealbook A 和声明的 pdf 文件。

我已经通过多种方式尝试过这一点。 第一次尝试是修改第一个链接

library(tidyverse)
library(rvest)

url <- "https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm"

page <- read_html(url)

urls_pdf <- page %>% 
  html_elements("a") %>% 
  html_attr("href") %>% 
  str_subset("\\.pdf") 

urls_pdf[1:3] %>% walk2(basename(.), download.file, mode = "wb")

dir(pattern = "\\.pdf")

但我什么也没得到。

第二次我尝试循环,找出一些青色书 A 日期的模式

# Initialize list to store links for tealbook A reports
tealA <- list()

# Generate links for tealbook A reports
for (i in seq_along(fomc_dates)) {
  this_fomc <- fomc_dates[i]
  this_teal_A <- this_fomc - days(12)
  link <- paste0("https://www.federalreserve.gov/monetarypolicy/files/FOMC", format(this_fomc, "%Y%m%d"), "tealbooka", format(this_teal_A, "%Y%m%d"), ".pdf")
  tealA[[i]] <- link
}

问题是这种模式并不适用于所有链接,因此它只适用于某些链接。 任何关于如何以最自动化的方式做到这一点的想法将不胜感激!

r web-scraping httr economics
1个回答
1
投票

这不是最优雅的方式,但它完成了

generate_links <- function(start_year, end_year) {
  links <- character()
  for (year in start_year:end_year) {
    links <- c(links, paste0("https://www.federalreserve.gov/monetarypolicy/fomchistorical", year, ".htm"))
  }
  return(links)
}

# Example: Generate links from 1970 to 2018
start_year <- 2017
end_year <- 2018


url_links <- generate_links(start_year, end_year)
base_url="https://www.federalreserve.gov"
# Define the URL of the webpage
no_urls=length(url_links)

for (this_url in 1:no_urls) {
  current_url=url_links[this_url]
  
  
  # Read the HTML content of the webpage
  page <- read_html(current_url)
  
  # Extract all links from the webpage
  links <- page %>%
    html_elements("a") %>%
    html_attr("href")
  
  # Filter out links that contain "beige book", "tealbooka", or "statement"
  pdf_links <- grep("(BeigeBook|tealbooka|statement)", links, ignore.case = TRUE, value = TRUE)
  
  # Filter out links that point to PDF files
  pdf_links <- grep("\\.pdf$", links, value = TRUE)
  
  # Function to download PDF files
  download_pdfs <- function(links, output_directory) {
    # Create the output directory if it doesn't exist
    if (!dir.exists(output_directory)) {
      dir.create(output_directory, recursive = TRUE)
    }
    
    # Loop over each link and download the corresponding PDF file
    for (link in pdf_links) {
      file_name <- paste0(output_directory, "/", basename(link))
      this_link=paste0(base_url,link)
      response <- httr::GET(this_link)
      if (httr::status_code(response) == 200) {
        bin_data <- httr::content(response, "raw")
        writeBin(bin_data, file_name)
        cat("Downloaded:", file_name, "\n")
      } else {
        cat("Failed to download:", this_link, "\n")
      }
    }
  }
  
}


© www.soinside.com 2019 - 2024. All rights reserved.