我正在尝试网络抓取一个网站来检索不同州所有医院的网址。例如,https://guide.prod.iam.aha.org/guide/searchResults?query=Alabama 或 https://guide.prod.iam.aha.org/guide/searchResults?query=Alaska使用 JavaScript 动态填充网页以更改内容和 CSS 样式。当针对每个州显示搜索结果时,它可以是一页或多页。如果结果超过一页,则使用 JavaScript 添加导航 div,如下所示
可以使用页面底部的分页 div 导航结果或仅在一页上查看。但是,当页面只有一页时,导航 div 不会显示。此外,当显示导航 div 时,可以单击所有类名为“nav-link”的锚标记来到达最后一个结果页面。当显示最后一个结果页面时,类名为“navigation-last”或“navigation-next”的li标签也会获得一个名为“disabled”的新类名。
在这种情况下,我想循环遍历结果以收集并组合每个州每家医院的所有网址,这些网址附加到每个医院的“个人资料”按钮,如下所示
我下面的代码适用于多页面状态结果,但不适用于一页结果。但我需要它适用于这两种情况,无论分页 div 是否存在。
library(tidyverse)
library(rvest)
library(RSelenium)
library(wdman)
library(netstat)
selenium_object <- selenium(retcommand = T,check = F)
remote_driver <- rsDriver(browser = 'chrome',
chromever = "116.0.5845.98",
verbose = F,
port = free_port())
remDr <- remote_driver$client
collect_hospital_urls <- function(state_url){
remDr$navigate(state_url)
preferred_class <- "disabled"
all_profile_urls <- list()
while (TRUE) {
# start to collect all hospital profile links on each page
profile_tags <- remDr$findElements(using = "css", value = "a[_ngcontent-c10]")
# Extract href attributes
profile_href_attributes <- sapply(profile_tags, function(tag) {
tag$getElementAttribute("href")[[1]]
})
# combine to all page profiles
all_profile_urls <- append(all_profile_urls, profile_href_attributes)
# Find the li HTML element by its CSS selector
li_pagination_next <- remDr$findElement(using = "class name", "pagination-next")
# Check if the preferred class name is present
if (preferred_class %in% as.character(str_split(unlist(li_pagination_next$getElementAttribute("class")),"\\s+",simplify = T))) {
#
print("Preferred class found!")
break
} else {
# Click on the link to potentially load new content
next_button <- remDr$findElement(using = 'link text', 'Next')
next_button$clickElement()
print("Oj")
# Wait for some time to allow the new content to load
Sys.sleep(2)
}
}
all_profile_urls <- all_profile_urls |> unlist()
}
x <- collect_hospital_urls(state_url)
期待您的帮助
我尝试使用 while 循环遍历结果,但当它只有一页且未显示导航 div 时不起作用
如果 RSelenium 不是严格要求,我会使用 GraphQL API 和 JSON 响应。请求有效负载可以通过浏览器的开发工具找到,对于超级懒惰的方法(例如将请求复制为 cURL 并通过
httr2::curl_translate()
传递)来说有点挑战,因为后者在 JSON 字符串中处理转义序列,但为了快速原型设计,我们仍然可以通过 https://curlconverter.com/r/获取有效的
{httr}
请求,只是为了看看是否有任何东西会阻止 no-js 方法(例如 Cloudflare 反抓取措施)。在这种情况下,它工作得很好,我们可以测试它在不传递 cookie 和额外标头时是否继续工作。
library(dplyr, warn.conflicts = FALSE)
library(httr2)
library(purrr)
# graphql query extracted from POST request payload
graphql_payload <- jsonlite::parse_json(
'{
"query": "query Search($searchTerm: String, $searchType: String, $state: String, $cities: [String], $counties: [String], $pageNum: Int!) {\\n search(searchTerm: $searchTerm, searchType: $searchType, state: $state, city: $cities, county: $counties, pageNum: $pageNum) {\\n total\\n start\\n pageLength\\n result {\\n index\\n resultType\\n orgDisplayName\\n systemDisplayName\\n region\\n personDisplayName\\n title\\n address\\n ahaId\\n ahaMember\\n affiliateUnitOf\\n __typename\\n }\\n facet {\\n name\\n facetValue\\n __typename\\n }\\n error\\n __typename\\n }\\n}\\n",
"variables": {
"searchTerm": "Alabama",
"searchType": "all",
"pageNum": 1
},
"operationName": "Search"
}')
# set search term and page number in graphql query, make request
graphql_search <- function(graphql, term, page = 1){
graphql$variables$searchTerm <- term
graphql$variables$pageNum <- page
request("https://guide.prod.iam.aha.org/guide/graphql") %>%
req_body_json(graphql) %>%
req_perform() %>%
resp_body_json()
}
# execute hospital search, calculate last page number from the first response,
# if there there are more pages, fetch those as well
hosp_search <- function(term, graphql = graphql_payload){
results <- graphql_search(graphql, term, 1) %>% pluck("data", "search") %>% list()
last_page_n <- ceiling(as.numeric(results[[1]]$total) / as.numeric(results[[1]]$pageLength))
if (last_page_n > 1){
results_cont <- map(2:last_page_n, \(page) graphql_search(graphql, term, page) %>% pluck("data", "search"))
results <- c(results, results_cont)
}
results
}
# execute search,
# pluck "result" elements from returned list (each returned page is a list item),
# convert resulting list of named lists to a data.frame / tibble with bind_rows and
# generate urls from ahaId field
hosp_search("Alaska") %>%
map("result") %>%
bind_rows() %>%
mutate(url = paste0("https://guide.prod.iam.aha.org/guide/hospitalProfile/", ahaId)) %>%
select(orgDisplayName, URL)
结果:
#> # A tibble: 4 × 2
#> orgDisplayName url
#> <chr> <chr>
#> 1 Alaska Regional Hospital https://guide.prod.iam.aha.org/guide/hospita…
#> 2 Alaska Native Medical Center https://guide.prod.iam.aha.org/guide/hospita…
#> 3 Alaska Psychiatric Institute https://guide.prod.iam.aha.org/guide/hospita…
#> 4 Providence Alaska Medical Center https://guide.prod.iam.aha.org/guide/hospita…
创建于 2023-08-29,使用 reprex v2.0.2