R 文件从 URL 下载问题

Question

我有以下格式的 URL 列表：https://erc.undp.org/evaluation/evaluations/detail/7834

需要下载.pdf和.docx文件。人们需要点击这些来下载文件。超链接在网页上也不可见。我试图获取 xpath 但它不返回变量。我还尝试了其他各种方法，包括：

library(rvest)
# Get the HTML of the webpage
html <- read_html("https://erc.undp.org/evaluation/evaluations/detail/7834? 
        tab=documents")
# Find all the links on the webpage that point to downloadable documents
links <- html %>%
html_nodes("a") %>%
html_attr("href")

有什么方法可以获取上述页面上可用文档的 URL 列表。一旦我有了 URL 列表，我就可以使用 download.files() 来下载。

Answer 1

对于这样的页面，您通常可以在页面的一个 javascript 部分中找到数据，但您无法使用通常的

rvest

方法访问它。但是，如果将整个页面转换为文本，则可以使用（例如）

stringr

函数来隔离所需的部分。

您将需要使用类似 Chrome 的

inspect

功能的功能来了解正在发生的事情，而且很难找到。另外，尝试在浏览器中手动执行一次并注意 URL 的形成方式。

在这种情况下，如果您在 chrome 中搜索检查页面并搜索“来源”，您可以找到埋在 javascript 中的

documentId

。在浏览器上手动完成后，每个文档都有一个简单的下载页面，基于此

documentId

。这对我有用......

library(tidyverse)
library(rvest)
    
links <- read_html("https://erc.undp.org/evaluation/evaluations/detail/7834?tab=documents") %>% 
  as.character() %>% #save whole page as character
  str_extract_all("\\{.+\\}") %>% #data is in { }
  pluck(1) %>% #unlist
  tail(1) %>%  #we want the last element
  str_extract_all("documentId.+?\\,") %>% #look for documentIds
  pluck(1) %>% #unlist
  str_extract("\\d+") %>% #extract numeric Id
  paste0("https://erc.undp.org/evaluation/documents/download/", .) #create URL

links
[1] "https://erc.undp.org/evaluation/documents/download/18959"
[2] "https://erc.undp.org/evaluation/documents/download/18971"

然后您可以将这些传递给

download.file

或任何您想用它做的事情。

R 文件从 URL 下载问题

问题描述投票：0回答：1

1个回答

最新问题

R 文件从 URL 下载问题

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1