我有以下网站:https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod 我想下载 2021 年到 2023 年的所有文件。 进入网站后,您可以在不同的文件夹之间进行选择,但现在我只想关注 2023 年的文件夹并下载该文件夹中的所有文件。
我尝试使用循环和 rvest 包但无济于事。我希望能够下载 2023 文件夹中的所有文件,但我找不到代码。 请帮忙。
额外信息:
所以我使用的代码是非常基本的代码,因为我刚刚开始在 R 中处理更复杂的任务,这是我尝试过的。
IOD <- read_html("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F")
urls <- IOD %>%
html_nodes('context-menu-list-context-menu-root') %>% # get all `area` nodes
html_attr('href') %>% # get the link attribute of each node
sub('.htm$', '.zip', .) %>% # change file suffix
paste0('https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod', .) # append to base URL
# create a directory for it all
dir <- file.path(tempdir(), 'COES')
dir.create(dir)
lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))
# check it's there
list.files(dir)
运行该代码后,输出为:
IOD <- read_html("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F")
urls <- IOD %>%
html_nodes('context-menu-list-context-menu-root') %>%
# get all `area` nodes
html_attr('href') %>%
# get the link attribute of each node
sub('.htm$', '.zip', .) %>%
# change file suffix
paste0('https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod', .) # append to base URL
# create a directory for it all
dir <- file.path(tempdir(), 'COES')
dir.create(dir)
# Warning message:
# In dir.create(dir) :
# 'C:\Users\RCV\AppData\Local\Temp\Rtmp0AUH8C\COES' already exists
lapply(urls, function(url) download.file(url, file.path(dir, basename(url))))
# probando la URL 'https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod'
# Content type 'text/html; charset=utf-8' length 48185 bytes (47 KB)
# downloaded 47 KB
# [[1]]
# [1] 0
# check it's there
list.files(dir)
# [1] "Ieod" "Ieod#"
说实话,我实际上不知道该怎么做。 抱歉,这是一个基本问题。
我对
rvest::read_html_live()
的尝试:
首先我们必须获得每月的链接:
url <- "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post%20Operaci%C3%B3n%2FReportes%2FIEOD%2F2023%2F"
ses <- rvest::read_html_live(url)
# ses$view()
months <- ses |>
rvest::html_elements(xpath = "//li[contains(@id, \"Post Operación/Reportes/IEOD/2023\")]") |>
rvest::html_attr("id")
months <- paste0("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=", months)
months[[1]]
# [1] "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post Operación/Reportes/IEOD/2023/12_Diciembre/"
现在,每个月你都必须获得每日链接(下面只是一个月的示例,你应该使用 lapply 或其他迭代来扩展它):
ses <- rvest::read_html_live(months[[1]])
days <- ses |>
rvest::html_elements(xpath = "//a[contains(@id, \"Post Operación/Reportes/IEOD/2023/\")]") |>
rvest::html_attr("id")
daily_urls <- paste0("https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=", days)
daily_urls[[1]]
# [1] "https://www.coes.org.pe/Portal/PostOperacion/Reportes/Ieod?path=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/"
现在,我们有一个指向该月(12 月)中特定日期(31)的链接。我们必须从该页面中提取表格,例如:
ses <- rvest::read_html_live(daily_urls[[1]])
t <- ses |>
rvest::html_elements(xpath = "//*[@id=\"tbDocumentLibrary\"]") |>
rvest::html_table() |>
purrr::pluck(1)
并构建单个文件的 URL:
paste0("https://www.coes.org.pe/portal/browser/download?url=", days[[1]], t$Nombre)
# [1] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Dom_3112.pdf"
# [2] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo6_CMgCP_3112.zip"
# [3] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo5_Manttoejec_3112.xls"
# [4] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo4_Hop_3112.xlsx"
# [5] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo3_RPFyRSF_3112.xlsx"
# [6] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo2_Hidrologia_3112.xlsx"
# [7] "https://www.coes.org.pe/portal/browser/download?url=Post Operación/Reportes/IEOD/2023/12_Diciembre/31/Anexo1_Resumen_3112.xlsx"
请注意,它需要另一次迭代。总共 4 次迭代:1/年、2/月、3/天、4/单个文件。