网页抓取时错误长度(url) == 1 不为 TRUE

问题描述 投票:0回答:1

我有以下链接: https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=202112&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266 我想使用R进行网页抓取。目标是用“01”,“02”,“03”等替换这部分中的“202112”最后一位数字,直到12,然后下载每个页面的信息自动按下按钮。我有以下代码,但出现此错误。

library(RSelenium)

download_senamhi_data <- function(url_list) {
  
  # NUMERO ALEATORIO DE PUERTO 
  port <- as.integer(runif(1, min = 5000, max = 6000))
  
  # EJECUTAMOS EL DRIVER DE GOOGLE CHROME 
  rD <- rsDriver(port = port, browser = "chrome", 
                 chromever = "101.0.4951.15")
  
  remDrv <- rD$client
  
  for (url in url_list){
  
  # INGRESAR AL URL
  remDrv$navigate(url)
  
  # ENCONTRAR EL BOTON DE DESCARGA 
  down_button <- remDrv$findElement(using = "id", "export2")
  down_button$clickElement()
  
  }
  
  # CERRAR LA SESION ACTUAL
  remDrv$close()
  rD$server$stop()
  rm(rD, remDrv)
  gc()

}

# EJECUTAR LA FUNCION PARA DESCARGAR TODOS LOS MESES DE UN AÑO

list_url <- list()

for (i in 1:9) {
  
  list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=20210",
               i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
  
 }

for (i in 10:12) {
  
  list_url[i] = paste("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=109045&CBOFiltro=2021",
               i, "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266", sep = "")
  
}

download_senamhi_data(list_url)

Error in checkError(res) : 
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
r web-scraping rselenium
1个回答
0
投票

如果

{Rselenium}
在这里不是严格要求,我们可以使用
{rvest}
从 URL 列表中提取这些表。网站的 CSV 导出并没有好多少,它使用客户端 JavaScript 将 HTML 表转换为 CSV。

这里我使用

purrr::map
来迭代列表而不是 for 循环:

library(rvest)
library(dplyr)
library(purrr)

# build a vector of 12 urls
urls <- paste0("https://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php",
               "?estaciones=109045&CBOFiltro=2021", sprintf("%.2d", 1:12),
               "&t_e=M&estado=DIFERIDO&cod_old=154107&cate_esta=PLU&alt=2266")

# read content of all urls,
# from each page extract dataTable,
# parse table content,
# bind list of tibbles (1 per each month) into one,
# filter out header rows (first column is not a date string),
# set column names,
# convert date strings to dates and measuremnts to numeric

df <- urls %>% 
  map(read_html, .progress = TRUE) %>% 
  map(html_element, "table#dataTable") %>% 
  map(html_table) %>% 
  bind_rows() %>% 
  filter(grepl("^\\d{4}-\\d{2}-\\d{2}$",X1)) %>% 
  set_names(c("date", "temp_max", "temp_min", "hum_rel", "prec")) %>% 
  mutate(date = lubridate::ymd(date)) %>% 
  mutate(across(temp_max:prec, as.numeric))

# save as csv:
readr::write_csv(df, "out.csv")

全年 365 行数据集:

df
#> # A tibble: 365 × 5
#>    date       temp_max temp_min hum_rel  prec
#>    <date>        <dbl>    <dbl>   <dbl> <dbl>
#>  1 2021-01-01       NA       NA      NA   0  
#>  2 2021-01-02       NA       NA      NA   1.2
#>  3 2021-01-03       NA       NA      NA   1.9
#>  4 2021-01-04       NA       NA      NA   0  
#>  5 2021-01-05       NA       NA      NA   6  
#>  6 2021-01-06       NA       NA      NA   1.9
#>  7 2021-01-07       NA       NA      NA   4.2
#>  8 2021-01-08       NA       NA      NA   2.5
#>  9 2021-01-09       NA       NA      NA   0  
#> 10 2021-01-10       NA       NA      NA   0  
#> # ℹ 355 more rows

创建于 2023-09-22,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.