使用 rvest 跨多个页面进行网页抓取

问题描述 投票:0回答:1

我正在尝试在以下网站上抓取有关欧洲可再生能源制造商、供应商和公司的信息:https://www.energy-xprt.com/renewable-energy/companies/location-europe/

第一步是收集列表中每个公司的网址,但是当我运行循环来跨页面抓取时,我从第一页获取公司的链接。我的代码看起来像

link <- paste0('https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-',1:78)  
result <- lapply(link, function(x) x %>% 
                   read_html %>% html_nodes("[class='h2 mb-0']") %>% html_elements('a') %>% html_attr('href')
                 ) %>% unlist() %>% unique()

我希望获得一个包含所有 78 个页面的公司网址的向量

r web-scraping rvest
1个回答
0
投票
library(tidyverse)
library(httr2)
library(rvest)

data <- str_c(
    "https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-",
    1:78
  ) %>%
  map(request) %>%
  req_perform_parallel() %>%
  map(resp_body_html)

data %>%
  map_dfr( ~ html_elements(.x, ".product-item") %>%
             map_dfr(
               ~ tibble(
                 company_name = html_element(.x, ".h2.mb-0") %>%
                   html_text2(),
                 type = html_element(.x, ".product-supplier-name") %>%
                   html_text2() %>%
                   str_remove_all("\r") %>%
                   str_squish(),
                 location = html_element(.x, ".col.ps-0") %>%
                   html_text2() %>%
                   str_remove_all(" based in ") %>%
                   str_remove_all("\r|\n"),
                 link = html_element(.x, ".h2.mb-0 a") %>%
                   html_attr("href")
               )
             ))

# A tibble: 1,560 × 4
   company_name                           type             location                       link                  
   <chr>                                  <chr>            <chr>                          <chr>                 
 1 Lindner-Recyclingtech GmbH             Manufacturer     Spittal/Drau, AUSTRIA          https://www.energy-xp…
 2 OHMSETT                                Service provider Leonardo, NEW JERSEY (USA)     https://www.energy-xp…
 3 Excalibur Water Systems Inc.           Manufacturer     Barrie, ONTARIO (CANADA)       https://www.energy-xp…
 4 Zygo Corporation - AMETEK, Inc         Manufacturer     Middlefield, CONNECTICUT (USA) https://www.energy-xp…
 5 Real Tech Inc.                         Manufacturer     Whitby, ONTARIO (CANADA)       https://www.energy-xp…
 6 Proco Products, Inc.                   Manufacturer     Stockton, CALIFORNIA (USA)     https://www.energy-xp…
 7 Energy Systems & Design (ES&D)         Manufacturer     Sussex, NEW BRUNSWICK (CANADA) https://www.energy-xp…
 8 HRS Heat Exchangers Ltd.               Manufacturer     Watford, UNITED KINGDOM        https://www.energy-xp…
 9 Samyang Corporation                    Manufacturer     Jongno-gu, SOUTH KOREA         https://www.energy-xp…
10 Arthur Freedman Associates, Inc. (AFA) Consulting firm  Dyer, INDIANA (USA)            https://www.energy-xp…
# ℹ 1,550 more rows
# ℹ Use `print(n = ...)` to see more rows
© www.soinside.com 2019 - 2024. All rights reserved.