我的问题是,每次我抓取给定静态站点的不同部分(我不知道确切的页码)时,我都必须手动更改“pages_to_scrape”整数向量。所以,我想自动化这个。换句话说,当我事先不知道页码时,我希望能够抓取所有可用的页面。
library(rvest)
base_url <- "https://morskidar.bg/products.php?category=pryasna-riba&page=%d"
scrape_prices <- function(page) {
url <- sprintf(base_url, page)
page_content <- read_html(url)
pc <- page_content %>%
html_elements(".col-sm-6") %>%
map_dfr(~ tibble(
product = .x %>%
html_element(".shop-three-products-name a") %>%
html_text2(),
price = .x %>%
html_element(".shop-three-products-price") %>%
html_text2(),
)) %>%
mutate(date = Sys.Date(),
location = "Unknown",
type = "Unknown",
source = "Unknown", .before = product) %>%
separate_wider_delim(price, delim = " - ", names = c("unit", "price")) %>%
mutate(price = parse_number(price), unit = str_remove(unit, "\\.")) %>%
distinct()
return(pc)
}
pages_to_scrape <- 1:5
final_df <- map_dfr(pages_to_scrape, scrape_prices)
如果网站是分页的,您可以使用这样的自定义函数来确定页面:
get_total_pages <- function() {
url <- sprintf(base_url, 1)
page_content <- read_html(url)
total_pages <- page_content %>%
html_element(".pagination li:last-child a") %>%
html_attr("data-ci-pagination-page")
return(as.integer(total_pages))
}
# Check the total number of pages
total_pages <- get_total_pages()
# Scrape prices for all pages
pages_to_scrape <- 1:total_pages
final_df <- map_dfr(pages_to_scrape, scrape_prices)
我尝试打开 url 来检查 html 内容,但它只加载一个徽标,没有其他内容,因此请告诉我结构是否有所变化,我可以调整自定义函数来提供帮助,否则这可能是您的起点。