我正在尝试在以下网站上抓取有关欧洲可再生能源制造商、供应商和公司的信息:https://www.energy-xprt.com/renewable-energy/companies/location-europe/。
第一步是收集列表中每个公司的网址,但是当我运行循环来跨页面抓取时,我从第一页获取公司的链接。我的代码看起来像
link <- paste0('https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-',1:78)
result <- lapply(link, function(x) x %>%
read_html %>% html_nodes("[class='h2 mb-0']") %>% html_elements('a') %>% html_attr('href')
) %>% unlist() %>% unique()
我希望获得一个包含所有 78 个页面的公司网址的向量
library(tidyverse)
library(httr2)
library(rvest)
data <- str_c(
"https://www.energy-xprt.com/renewable-energy/companies/location-europe/page-",
1:78
) %>%
map(request) %>%
req_perform_parallel() %>%
map(resp_body_html)
data %>%
map_dfr( ~ html_elements(.x, ".product-item") %>%
map_dfr(
~ tibble(
company_name = html_element(.x, ".h2.mb-0") %>%
html_text2(),
type = html_element(.x, ".product-supplier-name") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish(),
location = html_element(.x, ".col.ps-0") %>%
html_text2() %>%
str_remove_all(" based in ") %>%
str_remove_all("\r|\n"),
link = html_element(.x, ".h2.mb-0 a") %>%
html_attr("href")
)
))
# A tibble: 1,560 × 4
company_name type location link
<chr> <chr> <chr> <chr>
1 Lindner-Recyclingtech GmbH Manufacturer Spittal/Drau, AUSTRIA https://www.energy-xp…
2 OHMSETT Service provider Leonardo, NEW JERSEY (USA) https://www.energy-xp…
3 Excalibur Water Systems Inc. Manufacturer Barrie, ONTARIO (CANADA) https://www.energy-xp…
4 Zygo Corporation - AMETEK, Inc Manufacturer Middlefield, CONNECTICUT (USA) https://www.energy-xp…
5 Real Tech Inc. Manufacturer Whitby, ONTARIO (CANADA) https://www.energy-xp…
6 Proco Products, Inc. Manufacturer Stockton, CALIFORNIA (USA) https://www.energy-xp…
7 Energy Systems & Design (ES&D) Manufacturer Sussex, NEW BRUNSWICK (CANADA) https://www.energy-xp…
8 HRS Heat Exchangers Ltd. Manufacturer Watford, UNITED KINGDOM https://www.energy-xp…
9 Samyang Corporation Manufacturer Jongno-gu, SOUTH KOREA https://www.energy-xp…
10 Arthur Freedman Associates, Inc. (AFA) Consulting firm Dyer, INDIANA (USA) https://www.energy-xp…
# ℹ 1,550 more rows
# ℹ Use `print(n = ...)` to see more rows