使用 RSelenium 进行网页抓取

问题描述 投票:0回答:1

我正在尝试网络抓取一个网站来检索不同州所有医院的网址。例如,https://guide.prod.iam.aha.org/guide/searchResults?query=Alabamahttps://guide.prod.iam.aha.org/guide/searchResults?query=Alaska使用 JavaScript 动态填充网页以更改内容和 CSS 样式。当针对每个州显示搜索结果时,它可以是一页或多页。如果结果超过一页,则使用 JavaScript 添加导航 div,如下所示

可以使用页面底部的分页 div 导航结果或仅在一页上查看。但是,当页面只有一页时,导航 div 不会显示。此外,当显示导航 div 时,可以单击所有类名为“nav-link”的锚标记来到达最后一个结果页面。当显示最后一个结果页面时,类名为“navigation-last”或“navigation-next”的li标签也会获得一个名为“disabled”的新类名。

在这种情况下,我想循环遍历结果以收集并组合每个州每家医院的所有网址,这些网址附加到每个医院的“个人资料”按钮,如下所示

我下面的代码适用于多页面状态结果,但不适用于一页结果。但我需要它适用于这两种情况,无论分页 div 是否存在。

library(tidyverse)
library(rvest)
library(RSelenium)
library(wdman)
library(netstat)

selenium_object <- selenium(retcommand = T,check = F)

remote_driver <- rsDriver(browser = 'chrome',
                          chromever = "116.0.5845.98",
                          verbose = F,
                          port = free_port())

remDr <- remote_driver$client

collect_hospital_urls <- function(state_url){
  
  remDr$navigate(state_url)
  
  preferred_class <- "disabled"
  all_profile_urls <- list()
  while (TRUE) {
    # start to collect all hospital profile links on each page
    profile_tags <- remDr$findElements(using = "css", value = "a[_ngcontent-c10]")
    
    # Extract href attributes
    profile_href_attributes <- sapply(profile_tags, function(tag) {
      tag$getElementAttribute("href")[[1]]
    })
    
    # combine to all page profiles
    all_profile_urls <- append(all_profile_urls, profile_href_attributes)
    
    # Find the li HTML element by its CSS selector
    li_pagination_next <- remDr$findElement(using = "class name", "pagination-next")      
    # Check if the preferred class name is present
    if (preferred_class %in% as.character(str_split(unlist(li_pagination_next$getElementAttribute("class")),"\\s+",simplify = T))) {
      # 
      print("Preferred class found!")
      break
    } else {
      # Click on the link to potentially load new content
      next_button <- remDr$findElement(using = 'link text', 'Next')
      next_button$clickElement()
      print("Oj")
      # Wait for some time to allow the new content to load
      Sys.sleep(2)  
    }
  }
  
  all_profile_urls <- all_profile_urls |> unlist()
  
}

x <- collect_hospital_urls(state_url)

期待您的帮助

我尝试使用 while 循环遍历结果,但当它只有一页且未显示导航 div 时不起作用

r while-loop rselenium
1个回答
0
投票

如果 RSelenium 不是严格要求,我会使用 GraphQL API 和 JSON 响应。请求有效负载可以通过浏览器的开发工具找到,对于超级懒惰的方法(例如将请求复制为 cURL 并通过

httr2::curl_translate()
传递)来说有点挑战,因为后者在 JSON 字符串中处理转义序列,但为了快速原型设计,我们仍然可以通过
https://curlconverter.com/r/
获取有效的 {httr} 请求,只是为了看看是否有任何东西会阻止 no-js 方法(例如 Cloudflare 反抓取措施)。在这种情况下,它工作得很好,我们可以测试它在不传递 cookie 和额外标头时是否继续工作。

library(dplyr, warn.conflicts = FALSE)
library(httr2)
library(purrr)

# graphql query extracted from POST request payload
graphql_payload <- jsonlite::parse_json(
  '{
    "query": "query Search($searchTerm: String, $searchType: String, $state: String, $cities: [String], $counties: [String], $pageNum: Int!) {\\n  search(searchTerm: $searchTerm, searchType: $searchType, state: $state, city: $cities, county: $counties, pageNum: $pageNum) {\\n    total\\n    start\\n    pageLength\\n    result {\\n      index\\n      resultType\\n      orgDisplayName\\n      systemDisplayName\\n      region\\n      personDisplayName\\n      title\\n      address\\n      ahaId\\n      ahaMember\\n      affiliateUnitOf\\n      __typename\\n    }\\n    facet {\\n      name\\n      facetValue\\n      __typename\\n    }\\n    error\\n    __typename\\n  }\\n}\\n",
    "variables": {
      "searchTerm": "Alabama",
      "searchType": "all",
      "pageNum": 1
    },
    "operationName": "Search"
  }')  

# set search term and page number in graphql query, make request
graphql_search <- function(graphql, term, page = 1){
  graphql$variables$searchTerm <- term
  graphql$variables$pageNum <- page
  
  request("https://guide.prod.iam.aha.org/guide/graphql") %>% 
    req_body_json(graphql) %>% 
    req_perform() %>% 
    resp_body_json()
}

# execute hospital search, calculate last page number from the first response,
# if there there are more pages, fetch those as well
hosp_search <- function(term, graphql = graphql_payload){
  results <- graphql_search(graphql, term, 1) %>% pluck("data", "search") %>% list()
  last_page_n <- ceiling(as.numeric(results[[1]]$total) / as.numeric(results[[1]]$pageLength))
  if (last_page_n > 1){
    results_cont <- map(2:last_page_n, \(page) graphql_search(graphql, term, page) %>% pluck("data", "search"))
    results <- c(results, results_cont)
  }
  results
}

# execute search,
# pluck "result" elements from returned list (each returned page is a list item),
# convert resulting list of named lists to a data.frame / tibble with bind_rows and
# generate urls from ahaId field
hosp_search("Alaska") %>% 
  map("result") %>% 
  bind_rows() %>% 
  mutate(url = paste0("https://guide.prod.iam.aha.org/guide/hospitalProfile/", ahaId)) %>% 
  select(orgDisplayName, URL)

结果:

#> # A tibble: 4 × 2
#>   orgDisplayName                   url                                          
#>   <chr>                            <chr>                                        
#> 1 Alaska Regional Hospital         https://guide.prod.iam.aha.org/guide/hospita…
#> 2 Alaska Native Medical Center     https://guide.prod.iam.aha.org/guide/hospita…
#> 3 Alaska Psychiatric Institute     https://guide.prod.iam.aha.org/guide/hospita…
#> 4 Providence Alaska Medical Center https://guide.prod.iam.aha.org/guide/hospita…

创建于 2023-08-29,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.