在网页中获取中心表格

问题描述 投票:0回答:1

我在从网站抓取数据时遇到困难:https://scientific.sparx-ip.net/archiveeular/?c=s&view=2

我想获得带有摘要的中央表格,但如果我这样做的话


library(rvest)
page <- read_html("https://scientific.sparx-ip.net/archiveeular/?c=s&view=2") 
page %>% html_table()

我只得到一张小空桌子。

> page %>% html_table()
[[1]]
# A tibble: 4 x 2
  X1            X2
  <chr>      <dbl>
1 "version:"  1.02
2 ""         NA   
3 ""         NA   
4 ""         NA   

看起来我这样做只得到左侧栏(尝试

page %>%  html_text()
只给出左侧栏内容)。

我尝试使用

sessions
,但没有改善。

我做错了什么?

r web-scraping rvest
1个回答
0
投票

在这种情况下,从地址栏复制的 URL 不会定义您在浏览器中打开的页面 - 这些列表链接(和其他控件)会生成 2 个请求,即 href

https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si
会在服务器端触发视图更新但返回空响应并重定向到
/archiveeular/?c=s&view=2
,后者又响应实际内容。在新会话中直接访问
/archiveeular/?c=s&view=2
会提供不同的页面。

因此,您不想使用

read_html()
,而是使用能够处理重定向的东西,最好是自动处理,并且可以模拟浏览器会话:
rvest::session()

要访问表格并增加页面上显示的项目数,我们需要首先跳转一系列 URL 并保持会话;解析

session()
结果与
read_html()
结果相同:

library(rvest)
library(dplyr, warn.conflicts = FALSE)

# run in session: open page, jump to "Abstract Titles", set page size to 100
s <- session("https://scientific.sparx-ip.net/archiveeular/index.cfm") |>
  session_jump_to("https://scientific.sparx-ip.net/archiveeular/index.cfm?view=2&c=si") |>
  session_jump_to("https://scientific.sparx-ip.net/archiveeular/calc.cfm?pagesize=100")

# extract page count 
page_count <- 
  s |>
  html_element("div.items-found > b:last-of-type") |> 
  html_text() |> 
  as.integer()
page_count
#> [1] 37

# extract table, handle duplicate column names and drop empty column
get_table <- function(s) {
  html_element(s, "table.table-result") |> 
    html_table() |>
    setNames(c("item", "title", "blank")) |>
    select(item, title)
}
# navigate to a new page
jump_page <- function(s, n){
  session_jump_to(s, paste0("https://scientific.sparx-ip.net/archiveeular/calc.cfm?page=",n))
}

# list allocation for tables
tables <- vector(mode = "list", length = page_count)

# table from current(1st) page
tables[[1]] <- get_table(s)

# collect tables from next 4 pages
for (n in 2:5){
  message("Page ", n)
  s <- jump_page(s, n)
  tables[[n]] <- get_table(s)
}
#> Page 2
#> Page 3
#> Page 4
#> Page 5

# concat tables from the list
bind_rows(tables)

结果:

#> # A tibble: 500 × 2
#>    item              title                                                      
#>    <chr>             <chr>                                                      
#>  1 2023 POS0772      ´´EPIDEMIOLOGY OF JUVENILE IDIOPATHIC ARTHRITIS IN ARGENTI…
#>  2 2023 POS0147      αVβ3 INTEGRIN AS A LINKER BETWEEN FIBROSIS AND THYROID HOR…
#>  3 2023 OP0275-HPR   ‘IT’S A LOT TO TAKE IN’: A SYSTEMATIC REVIEW OF THE INFORM…
#>  4 2023 POS0175      “DO DISEASE MODIFYING ANTIRHEUMATIC DRUGS INFLUENCE THE FR…
#>  5 2023 AB1732-PARE  “FLARE, DID YOU SAY FLARE?” FLARES IN SJÖGREN’S DISEASE: T…
#>  6 2023 POS1447      “IF I HAVE SJÖGREN’S SYNDROME, I WANT TO KNOW IT AS EARLY …
#>  7 2023 AB0201       “IT SURPRISED ME A LOT THAT THERE IS A LINK”: A QUALITATIV…
#>  8 2023 POS0788-HPR  “IT’S LIKE LISTENING TO THE RADIO WITH A LITTLE INTERFEREN…
#>  9 2023 POS0201-PARE “MOOD IS HAPPY AND DOWNRIGHT WILD” - HEALTH PROMOTION AND …
#> 10 2023 POS1585-HPR  “SO, MEN WILL BE ABLE TO RECEIVE #METHOTREXATE FOR LUPUS A…
#> # ℹ 490 more rows

创建于 2024-01-17,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.