我正在尝试使用RSelenium
和
rvest
从此页面获取所有标题为“阅读更多”的链接
我使用的代码如下
igop_get_links <- function(url = "https://igop.uab.cat/category/publicacions/"){
site <- rvest::read_html(url)
taula <- rvest::html_elements(site, ".paginated_content")
text <- rvest::html_text(rvest::html_elements(taula, "a"))
links <- rvest::html_attr(rvest::html_elements(taula, "a"), "href")
df <- data.frame(text = text,
url = links)
df <- df[df$text== "Read More",]
return(df)
}
igop_get_pages <- function(url = "https://igop.uab.cat/category/publicacions"){
links <- igop_get_links(url)
# get max number of pages
site <- rvest::read_html(url)
max <- rvest::html_text(rvest::html_elements(site, ".pagination"))
max <- strsplit(max, "\n\t\t\t\t")
max <- sapply(max, function(x) gsub("\n|\t|\\.{3}", "", x), USE.NAMES = FALSE)
max <- max(as.numeric(max[max != ""]))
remDr <- RSelenium::rsDriver(
remoteServerAddr = "localhost",
port = 4445L,
browser = "firefox",chromever = NULL,
iedrver = NULL,
phantomver = NULL
)
remDr <- remDr[["client"]]
remDr$navigate(url)
for(i in 1:(max-1)){
webElem <- remDr$findElement(using = 'css selector',"a.next")
webElem$clickElement()
remDr$setTimeout(type = "page load", milliseconds = 10000)
linkspage <- igop_get_links(remDr$getCurrentUrl()[[1]])
links <- rbind(links, linkspage)
# linkspage <- s |>
# rvest::session_follow_link(css = "a.next") |>
# igop_get_links()
# links <- rbind(links, linkspage)
}
remDr$close()
return(links)
}
但是,当我尝试运行
t3 <- igop_get_pages()
时,这三件事中的任何一件都会发生,而无需我更改任何代码。
它崩溃并返回以下错误
Selenium message:No active session with ID 87c316d8-ded8-41e7-94d7-4a119e4006c1
Error: Summary: NoSuchDriver
Detail: A session is either terminated or not started
Further Details: run errorDetails method
它崩溃并显示以下消息
Could not open firefox browser.
Client error message:
Summary: UnknownError
Detail: An unknown server-side error occurred while processing the command.
Further Details: run errorDetails method
Check server log for further details.
Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE
或者它不会抛出任何错误,但无法导航到第二页以外的位置,即读取第一页,单击“下一步”按钮,读取第二页,然后返回到第一页并重复该过程。这不应该发生,“上一个”按钮有一个不同的 css 选择器(可以预见的是
a.prev
)。我尝试过使用 rvest::session_follow_link
但它不起作用,因为 URL 本身不会改变(它始终是 https://igop.uab.cat/category/publicacions/# 而不是 https://igop。 uab.cat/category/publicacions/2-3-whatever)。
我在 Windows 上使用 Firefox 118.0.2。
内容通过 Ajax 调用进行更新,首先将 POST 请求发送到
admin-ajax.php
,然后它将返回所请求页码的文章。当您在浏览器开发人员的网络选项卡中检查请求时,您可以找到该调用。工具,你可以自己模仿。但我建议您只使用 rvest
和 httr2
来处理此问题,而不是使用 RSelenium,您可以从浏览器开发人员复制实际请求。工具作为 cURL 并将其传递给 httr2::curl_translate()
以获得翻译后的 httr2
代码,您可以进一步调整该代码 - 例如测试是否所有这些标头实际上都是必需的以及是否可以修改请求参数。显然我们可以增加post_per_page
,如果我们也将to_page
设置为1
,我们只需一个请求就可以获得所有60篇文章。 post_per_page
不必与实际文章数匹配,我们也用 100 之类的东西进行测试。
以下示例从每篇文章中提取 3 个链接:标题、作者和评论数。
library(httr2)
library(rvest)
library(dplyr)
library(tidyr)
library(purrr)
# request list of articles though Wordpress admin-ajax.php,
# a POST call, so we'll use httr2;
# call is extracted from brwser's dev tools as cURL, translated with
# httr2::curl_translate(), few parts removed by trial and error;
# modified "to_page=1&posts_per_page=100" to control returned article collection
request("https://igop.uab.cat/wp-admin/admin-ajax.php") %>%
req_body_raw("action=extra_blog_feed_get_content&et_load_builder_modules=1&blog_feed_nonce=7e1f0a6567&to_page=1&posts_per_page=100&order=desc&orderby=date&categories=226&show_featured_image=1&blog_feed_module_type=masonry&et_column_type=&show_author=1&show_categories=1&show_date=1&show_rating=1&show_more=1&show_comments=1&date_format=M+j%2C+Y&content_length=excerpt&hover_overlay_icon=&use_tax_query=1&tax_query%5B0%5D%5Btaxonomy%5D=category&tax_query%5B0%5D%5Bterms%5D%5B%5D=publications-en&tax_query%5B0%5D%5Bfield%5D=slug&tax_query%5B0%5D%5Boperator%5D=IN&tax_query%5B0%5D%5Binclude_children%5D=true", "application/x-www-form-urlencoded; charset=UTF-8") %>%
req_perform() %>%
resp_body_html() %>%
# extract arcticle elements, returns xml_nodeset that we can process as a list
html_elements("article") %>%
# extract title / author / comments elemenets from every article,
# we'll have a list of named list of html_nodes
map(\(a) list(
title = html_element(a, ".post-title.entry-title a"),
author = html_element(a, ".vcard a[rel='author']"),
comments = html_element(a, ".vcard a.comments-link")
)) %>%
# apply a function to every html_node in out list (60 x 3) to extract href and text
map_depth(2, \(a) list(url = html_attr(a, "href"),
text = html_text(a) %>% trimws())) %>%
# current item structure looks like this:
# $ :List of 3
# ..$ title :List of 2
# .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/"
# .. ..$ text: chr "El arte de pactar"
# ..$ author :List of 2
# .. ..$ url : chr "https://igop.uab.cat/author/igop/"
# .. ..$ text: chr "IGOP"
# ..$ comments:List of 2
# .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/#comments"
# .. ..$ text: chr "0"
# rbind list and convert to tibble of 3 nested columns(title, author, comments),
# each column includes url & text)
do.call(rbind, args = .) %>% as.data.frame() %>%
as_tibble() %>%
# unnest to get 6 columns
unnest_wider(everything(), names_sep = ".")
结果:
#> # A tibble: 60 × 6
#> title.url title.text author.url author.text comments.url comments.text
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 https://igop.ua… El arte d… https://i… IGOP https://igo… 0
#> 2 https://igop.ua… Intersect… https://i… IGOP https://igo… 0
#> 3 https://igop.ua… EU agenci… https://i… IGOP https://igo… 0
#> 4 https://igop.ua… El apoyo … https://i… IGOP https://igo… 0
#> 5 https://igop.ua… The doubl… https://i… IGOP https://igo… 0
#> 6 https://igop.ua… Evaluatin… https://i… IGOP https://igo… 0
#> 7 https://igop.ua… Residenci… https://i… IGOP https://igo… 0
#> 8 https://igop.ua… Governmen… https://i… IGOP https://igo… 0
#> 9 https://igop.ua… Beyond re… https://i… IGOP https://igo… 0
#> 10 https://igop.ua… The emerg… https://i… IGOP https://igo… 0
#> # ℹ 50 more rows
创建于 2023-10-24,使用 reprex v2.0.2