RSelenium 未移动到第三页或因错误而崩溃没有具有 ID 的活动会话或未知的服务器端错误

问题描述 投票:0回答:1

我正在尝试使用RSelenium

rvest
此页面

获取所有标题为“阅读更多”的链接

我使用的代码如下

igop_get_links <- function(url = "https://igop.uab.cat/category/publicacions/"){
  site <- rvest::read_html(url)
  taula <- rvest::html_elements(site, ".paginated_content")
  text <- rvest::html_text(rvest::html_elements(taula, "a"))
  links <- rvest::html_attr(rvest::html_elements(taula, "a"), "href")
  df <- data.frame(text = text,
                   url = links)
  df <- df[df$text== "Read More",]
  return(df)
}

igop_get_pages <- function(url = "https://igop.uab.cat/category/publicacions"){
  links <- igop_get_links(url)
  # get max number of pages
  site <- rvest::read_html(url)
  max <- rvest::html_text(rvest::html_elements(site, ".pagination"))
  max <- strsplit(max, "\n\t\t\t\t")
  max <- sapply(max, function(x) gsub("\n|\t|\\.{3}", "", x), USE.NAMES = FALSE)
  max <- max(as.numeric(max[max != ""]))
  remDr <- RSelenium::rsDriver(
    remoteServerAddr = "localhost",
    port = 4445L,
    browser = "firefox",chromever = NULL,
    iedrver = NULL,
    phantomver = NULL
  )
  remDr <- remDr[["client"]]
  remDr$navigate(url)
  for(i in 1:(max-1)){
    webElem <- remDr$findElement(using = 'css selector',"a.next")
    webElem$clickElement()
    remDr$setTimeout(type = "page load", milliseconds = 10000)
    linkspage <- igop_get_links(remDr$getCurrentUrl()[[1]])
    links <- rbind(links, linkspage)
    # linkspage <- s |>
    #   rvest::session_follow_link(css = "a.next") |>
    #   igop_get_links()
    # links <- rbind(links, linkspage)
  }
  remDr$close()
  return(links)

}

但是,当我尝试运行

t3 <- igop_get_pages()
时,这三件事中的任何一件都会发生,而无需我更改任何代码。 它崩溃并返回以下错误

Selenium message:No active session with ID 87c316d8-ded8-41e7-94d7-4a119e4006c1

Error:   Summary: NoSuchDriver
     Detail: A session is either terminated or not started
     Further Details: run errorDetails method

它崩溃并显示以下消息

Could not open firefox browser.
Client error message:
     Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     Further Details: run errorDetails method
Check server log for further details.
Error in checkError(res) : 
  Undefined error in httr call. httr output: length(url) == 1 is not TRUE

或者它不会抛出任何错误,但无法导航到第二页以外的位置,即读取第一页,单击“下一步”按钮,读取第二页,然后返回到第一页并重复该过程。这不应该发生,“上一个”按钮有一个不同的 css 选择器(可以预见的是

a.prev
)。我尝试过使用
rvest::session_follow_link
但它不起作用,因为 URL 本身不会改变(它始终是 https://igop.uab.cat/category/publicacions/# 而不是 https://igop。 uab.cat/category/publicacions/2-3-whatever)。

我在 Windows 上使用 Firefox 118.0.2。

r rvest rselenium
1个回答
0
投票

内容通过 Ajax 调用进行更新,首先将 POST 请求发送到

admin-ajax.php
,然后它将返回所请求页码的文章。当您在浏览器开发人员的网络选项卡中检查请求时,您可以找到该调用。工具,你可以自己模仿。但我建议您只使用
rvest
httr2
来处理此问题,而不是使用 RSelenium,您可以从浏览器开发人员复制实际请求。工具作为 cURL 并将其传递给
httr2::curl_translate()
以获得翻译后的
httr2
代码,您可以进一步调整该代码 - 例如测试是否所有这些标头实际上都是必需的以及是否可以修改请求参数。显然我们可以增加
post_per_page
,如果我们也将
to_page
设置为
1
,我们只需一个请求就可以获得所有60篇文章。
post_per_page
不必与实际文章数匹配,我们也用 100 之类的东西进行测试。

以下示例从每篇文章中提取 3 个链接:标题、作者和评论数。

library(httr2)
library(rvest)
library(dplyr)
library(tidyr)
library(purrr)

# request list of articles though Wordpress admin-ajax.php, 
# a POST call, so we'll use httr2;
# call is extracted from brwser's dev tools as cURL, translated with
# httr2::curl_translate(), few parts removed by trial and error;
# modified "to_page=1&posts_per_page=100" to control returned article collection
request("https://igop.uab.cat/wp-admin/admin-ajax.php") %>% 
  req_body_raw("action=extra_blog_feed_get_content&et_load_builder_modules=1&blog_feed_nonce=7e1f0a6567&to_page=1&posts_per_page=100&order=desc&orderby=date&categories=226&show_featured_image=1&blog_feed_module_type=masonry&et_column_type=&show_author=1&show_categories=1&show_date=1&show_rating=1&show_more=1&show_comments=1&date_format=M+j%2C+Y&content_length=excerpt&hover_overlay_icon=&use_tax_query=1&tax_query%5B0%5D%5Btaxonomy%5D=category&tax_query%5B0%5D%5Bterms%5D%5B%5D=publications-en&tax_query%5B0%5D%5Bfield%5D=slug&tax_query%5B0%5D%5Boperator%5D=IN&tax_query%5B0%5D%5Binclude_children%5D=true", "application/x-www-form-urlencoded; charset=UTF-8") %>% 
  req_perform() %>% 
  resp_body_html() %>% 
  # extract arcticle elements, returns xml_nodeset that we can process as a list
  html_elements("article") %>% 
  # extract title / author / comments elemenets from every article, 
  # we'll have a list of named list of html_nodes
  map(\(a) list(
    title = html_element(a, ".post-title.entry-title a"),
    author = html_element(a, ".vcard a[rel='author']"),
    comments = html_element(a, ".vcard a.comments-link")
    )) %>% 
  # apply a function to every html_node in out list (60 x 3) to extract href and text
  map_depth(2, \(a) list(url = html_attr(a, "href"),
                         text = html_text(a) %>% trimws())) %>% 
  # current item structure looks like this:
  # $ :List of 3
  #  ..$ title   :List of 2
  #  .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/"
  #  .. ..$ text: chr "El arte de pactar"
  #  ..$ author  :List of 2
  #  .. ..$ url : chr "https://igop.uab.cat/author/igop/"
  #  .. ..$ text: chr "IGOP"
  #  ..$ comments:List of 2
  #  .. ..$ url : chr "https://igop.uab.cat/2023/03/04/el-arte-de-pactar/#comments"
  #  .. ..$ text: chr "0"
  
  # rbind list and convert to tibble of 3 nested columns(title, author, comments), 
  # each column includes url & text)
  do.call(rbind, args = .) %>% as.data.frame() %>%
  as_tibble() %>% 
  # unnest to get 6 columns
  unnest_wider(everything(), names_sep = ".")

结果:

#> # A tibble: 60 × 6
#>    title.url        title.text author.url author.text comments.url comments.text
#>    <chr>            <chr>      <chr>      <chr>       <chr>        <chr>        
#>  1 https://igop.ua… El arte d… https://i… IGOP        https://igo… 0            
#>  2 https://igop.ua… Intersect… https://i… IGOP        https://igo… 0            
#>  3 https://igop.ua… EU agenci… https://i… IGOP        https://igo… 0            
#>  4 https://igop.ua… El apoyo … https://i… IGOP        https://igo… 0            
#>  5 https://igop.ua… The doubl… https://i… IGOP        https://igo… 0            
#>  6 https://igop.ua… Evaluatin… https://i… IGOP        https://igo… 0            
#>  7 https://igop.ua… Residenci… https://i… IGOP        https://igo… 0            
#>  8 https://igop.ua… Governmen… https://i… IGOP        https://igo… 0            
#>  9 https://igop.ua… Beyond re… https://i… IGOP        https://igo… 0            
#> 10 https://igop.ua… The emerg… https://i… IGOP        https://igo… 0            
#> # ℹ 50 more rows

创建于 2023-10-24,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.