如何在网页抓取中自动更改页码?

问题描述 投票:0回答:1

我的问题是,每次我抓取给定静态站点的不同部分(我不知道确切的页码)时,我都必须手动更改“pages_to_scrape”整数向量。所以,我想自动化这个。换句话说,当我事先不知道页码时,我希望能够抓取所有可用的页面。

library(rvest)

base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
scrape_prices <- function(page) {
  url <- sprintf(base_url, page)
  page_content <- read_html(url)
  pc <- page_content %>% 
    html_elements(".col-sm-6") %>%
    map_dfr(~ tibble(
      product = .x %>% 
        html_element(".shop-three-products-name a") %>% 
        html_text2(), 
      price = .x %>% 
        html_element(".shop-three-products-price") %>% 
        html_text2(),
      )) %>% 
    mutate(date = Sys.Date(),
               location = "Unknown",
               type = "Unknown",
               source = "Unknown", .before = product) %>% 
    separate_wider_delim(price, delim = " - ", names = c("unit", "price")) %>% 
    mutate(price = parse_number(price), unit = str_remove(unit, "\\.")) %>% 
    distinct()
  return(pc)
  }
pages_to_scrape <- 1:5
final_df <- map_dfr(pages_to_scrape, scrape_prices)
r web-scraping rvest
1个回答
0
投票

如果网站是分页的,您可以使用这样的自定义函数来确定页面:

get_total_pages <- function() {
  url <- sprintf(base_url, 1)
  page_content <- read_html(url)
  total_pages <- page_content %>% 
    html_element(".pagination li:last-child a") %>% 
    html_attr("data-ci-pagination-page")
  return(as.integer(total_pages))
}

# Check the total number of pages
total_pages <- get_total_pages()

# Scrape prices for all pages
pages_to_scrape <- 1:total_pages
final_df <- map_dfr(pages_to_scrape, scrape_prices)

我尝试打开 url 来检查 html 内容,但它只加载一个徽标,没有其他内容,因此请告诉我结构是否有所变化,我可以调整自定义函数来提供帮助,否则这可能是您的起点。

© www.soinside.com 2019 - 2024. All rights reserved.