我有以下链接: https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=202112&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266 我想使用R进行网页抓取。目标是用“01”,“02”,“03”等替换这部分中的“202112”最后一位数字,直到12,然后下载每个页面的信息自动按下按钮。我有以下代码,但出现此错误。
library(RSelenium)
download_senamhi_data <- function(url_list) {
# NUMERO ALEATORIO DE PUERTO
port <- as.integer(runif(1, min = 5000, max = 6000))
# EJECUTAMOS EL DRIVER DE GOOGLE CHROME
rD <- rsDriver(port = port, browser = "chrome",
chromever = "101.0.4951.15")
remDrv <- rD$client
for (url in url_list){
# INGRESAR AL URL
remDrv$navigate(url)
# ENCONTRAR EL BOTON DE DESCARGA
down_button <- remDrv$findElement(using = "id", "export2")
down_button$clickElement()
}
# CERRAR LA SESION ACTUAL
remDrv$close()
rD$server$stop()
rm(rD, remDrv)
gc()
}
# EJECUTAR LA FUNCION PARA DESCARGAR TODOS LOS MESES DE UN AÑO
list_url <- list()
for (i in 1:9) {
list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=20210",
i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
}
for (i in 10:12) {
list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=2021",
i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
}
download_senamhi_data(list_url)
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
如果
{Rselenium}
在这里不是严格要求,我们可以使用 {rvest}
从 URL 列表中提取这些表。网站的 CSV 导出并没有好多少,它使用客户端 JavaScript 将 HTML 表转换为 CSV。
这里我使用
purrr:map
来迭代列表而不是 for 循环:
library(rvest)
library(dplyr)
library(purrr)
# build a vector of 12 urls
urls <- paste0("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php",
"?estaciones=109045&CBOFiltro=2021", sprintf("%.2d", 1:12),
"&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266")
# read content of all urls,
# from each page extract dataTable,
# parse table content,
# bind list of tibbles (1 per each month) into one,
# filter out header rows (first column is not a date string),
# set column names,
# convert date strings to dates and measuremnts to numeric
df <- urls %>%
map(read_html, .progress = TRUE) %>%
map(html_element, "table#dataTable") %>%
map(html_table) %>%
bind_rows() %>%
filter(grepl("^\\d{4}-\\d{2}-\\d{2}$",X1)) %>%
set_names(c("date", "temp_max", "temp_min", "hum_rel", "prec")) %>%
mutate(date = lubridate::ymd(date)) %>%
mutate(across(temp_max:prec, as.numeric))
# save as csv:
readr::write_csv(df, "out.csv")
全年 365 行数据集:
df
#> # A tibble: 365 × 5
#> date temp_max temp_min hum_rel prec
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2021-01-01 NA NA NA 0
#> 2 2021-01-02 NA NA NA 1.2
#> 3 2021-01-03 NA NA NA 1.9
#> 4 2021-01-04 NA NA NA 0
#> 5 2021-01-05 NA NA NA 6
#> 6 2021-01-06 NA NA NA 1.9
#> 7 2021-01-07 NA NA NA 4.2
#> 8 2021-01-08 NA NA NA 2.5
#> 9 2021-01-09 NA NA NA 0
#> 10 2021-01-10 NA NA NA 0
#> # ℹ 355 more rows
创建于 2023-09-22,使用 reprex v2.0.2