如何在网页抓取中自动更改页码？

Question

我的问题是，每次我抓取给定静态站点的不同部分（我不知道确切的页码）时，我都必须手动更改“pages_to_scrape”整数向量。所以，我想自动化这个。换句话说，当我事先不知道页码时，我希望能够抓取所有可用的页面。

library(rvest)

base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
scrape_prices <- function(page) {
  url <- sprintf(base_url, page)
  page_content <- read_html(url)
  pc <- page_content %>% 
    html_elements(".col-sm-6") %>%
    map_dfr(~ tibble(
      product = .x %>% 
        html_element(".shop-three-products-name a") %>% 
        html_text2(), 
      price = .x %>% 
        html_element(".shop-three-products-price") %>% 
        html_text2(),
      )) %>% 
    mutate(date = Sys.Date(),
               location = "Unknown",
               type = "Unknown",
               source = "Unknown", .before = product) %>% 
    separate_wider_delim(price, delim = " - ", names = c("unit", "price")) %>% 
    mutate(price = parse_number(price), unit = str_remove(unit, "\\.")) %>% 
    distinct()
  return(pc)
  }
pages_to_scrape <- 1:5
final_df <- map_dfr(pages_to_scrape, scrape_prices)

Answer 1

如果网站是分页的，您可以使用这样的自定义函数来确定页面：

get_total_pages <- function() {
  url <- sprintf(base_url, 1)
  page_content <- read_html(url)
  total_pages <- page_content %>% 
    html_element(".pagination li:last-child a") %>% 
    html_attr("data-ci-pagination-page")
  return(as.integer(total_pages))
}

# Check the total number of pages
total_pages <- get_total_pages()

# Scrape prices for all pages
pages_to_scrape <- 1:total_pages
final_df <- map_dfr(pages_to_scrape, scrape_prices)

我尝试打开 url 来检查 html 内容，但它只加载一个徽标，没有其他内容，因此请告诉我结构是否有所变化，我可以调整自定义函数来提供帮助，否则这可能是您的起点。

如何在网页抓取中自动更改页码？

问题描述投票：0回答：1

1个回答

最新问题

如何在网页抓取中自动更改页码？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1