在 R 中抓取文档

Question

我正在尝试从以下网页下载一个Word文档。当您按下按钮时，Word 文档将自动下载，不会显示任何下载链接。

现在我正在尝试处理 XPath，以在 R 中下载此文档。

library(rvest)

# send an HTTP GET request to the URL
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"
page <- read_html(url)

# locate the link to the Word document using CSS selector
doc_link <- page %>%
  html_nodes(xpath='//*[@id="action_word_export"]')%>%
  html_attr("href")

但不幸的是，这不起作用，无法下载任何东西。那么有人可以帮助解决这个问题并在 R 环境中下载一个 Word 文档吗？

Answer 1

问题是按钮触发了一个实际发送下载请求的javascript脚本，所以没有与按钮关联的链接。如果您愿意使用 RSelenium，可以通过以下方式下载文件：


library(RSelenium)
library(rvest)
library(magrittr)

# define target url
url <- "https://ec.europa.eu/taxation_customs/tedb/taxDetails.html?id=4205/1672527600"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)


# click on a button ------------------------------------
remDr$findElement(using = "xpath",value = '//*[@id="action_word_export"]')$clickElement()

在 R 中抓取文档

问题描述投票：0回答：1

1个回答

最新问题

在 R 中抓取文档

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1