我在从网站抓取数据时遇到困难:https://scientific.sparx-ip.net/archiveeular/?c=s&view=2
我想获得带有摘要的中央表格,但如果我这样做的话
library(rvest)
page <- read_html("https://scientific.sparx-ip.net/archiveeular/?c=s&view=2")
page %>% html_table()
我只得到一张小空桌子。
> page %>% html_table()
[[1]]
# A tibble: 4 x 2
X1 X2
<chr> <dbl>
1 "version:" 1.02
2 "" NA
3 "" NA
4 "" NA
看起来我这样做只得到左侧栏(尝试
page %>% html_text()
只给出左侧栏内容)。
我尝试使用
sessions
,但没有改善。
我做错了什么?
在这种情况下,从地址栏复制的 URL 不会定义您在浏览器中打开的页面 - 这些列表链接(和其他控件)会生成 2 个请求,即 href
https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si
会在服务器端触发视图更新但返回空响应并重定向到 /archiveeular/?c=s&view=2
,后者又响应实际内容。在新会话中直接访问/archiveeular/?c=s&view=2
会提供不同的页面。
因此,您不想使用
read_html()
,而是使用能够处理重定向的东西,最好是自动处理,并且可以模拟浏览器会话:rvest::session()
要访问表格并增加页面上显示的项目数,我们需要首先跳转一系列 URL 并保持会话;解析
session()
结果与 read_html()
结果相同:
library(rvest)
library(dplyr, warn.conflicts = FALSE)
# run in session: open page, jump to "Abstract Titles", set page size to 100
s <- session("https://scientific.sparx-ip.net/archiveeular/index.cfm") |>
session_jump_to("https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si") |>
session_jump_to("https://scientific.sparx-ip.net/archiveeular/calc.cfm?pagesize=100")
# extract page count
page_count <-
s |>
html_element("div.items-found > b:last-of-type") |>
html_text() |>
as.integer()
page_count
#> [1] 37
# extract table, handle duplicate column names and drop empty column
get_table <- function(s) {
html_element(s, "table.table-result") |>
html_table() |>
setNames(c("item", "title", "blank")) |>
select(item, title)
}
# navigate to a new page
jump_page <- function(s, n){
session_jump_to(s, paste0("https://scientific.sparx-ip.net/archiveeular/calc.cfm?page=",n))
}
# list allocation for tables
tables <- vector(mode = "list", length = page_count)
# table from current(1st) page
tables[[1]] <- get_table(s)
# collect tables from next 4 pages
for (n in 2:5){
message("Page ", n)
s <- jump_page(s, n)
tables[[n]] <- get_table(s)
}
#> Page 2
#> Page 3
#> Page 4
#> Page 5
# concat tables from the list
bind_rows(tables)
结果:
#> # A tibble: 500 × 2
#> item title
#> <chr> <chr>
#> 1 2023 POS0772 ´´EPIDEMIOLOGY OF JUVENILE IDIOPATHIC ARTHRITIS IN ARGENTI…
#> 2 2023 POS0147 αVβ3 INTEGRIN AS A LINKER BETWEEN FIBROSIS AND THYROID HOR…
#> 3 2023 OP0275-HPR ‘IT’S A LOT TO TAKE IN’: A SYSTEMATIC REVIEW OF THE INFORM…
#> 4 2023 POS0175 “DO DISEASE MODIFYING ANTIRHEUMATIC DRUGS INFLUENCE THE FR…
#> 5 2023 AB1732-PARE “FLARE, DID YOU SAY FLARE?” FLARES IN SJÖGREN’S DISEASE: T…
#> 6 2023 POS1447 “IF I HAVE SJÖGREN’S SYNDROME, I WANT TO KNOW IT AS EARLY …
#> 7 2023 AB0201 “IT SURPRISED ME A LOT THAT THERE IS A LINK”: A QUALITATIV…
#> 8 2023 POS0788-HPR “IT’S LIKE LISTENING TO THE RADIO WITH A LITTLE INTERFEREN…
#> 9 2023 POS0201-PARE “MOOD IS HAPPY AND DOWNRIGHT WILD” - HEALTH PROMOTION AND …
#> 10 2023 POS1585-HPR “SO, MEN WILL BE ABLE TO RECEIVE #METHOTREXATE FOR LUPUS A…
#> # ℹ 490 more rows
创建于 2024-01-17,使用 reprex v2.0.2