在 R 中从一个复杂的网站抓取数据(rvest 包)

问题描述 投票:0回答:1

对于一个研究项目,我想从每个学院的每个项目中提取所有课程信息。我已经练习了很多从网站上抓取数据的方法,但是我大学的网站的编码似乎与我练习过的网站不同。

这是网站链接: https://ocasys.rug.nl/current/catalog

我的目标是制作一个脚本,可以提取所有程序名称,并带有相应的课程超链接。然后可以使用类似的脚本来提取每个课程的每个课程中具有相应课程代码的所有课程名称。最终,这些课程代码可用于生成指向每个独特课程页面的 URL,可以从中抓取相关数据。

[This is where the information that I need is located - stored in this  element: name + hyperlink. This information is, per programme, stored in a li. element if I understand correclty. This prompted me to try and extract all of these li. elements in order to extract all the 'a']

我已经尝试了很多方法来提取它,但到目前为止我一直很不成功。为了确保我没有做错任何事,我还使用相同的脚本从其他网站提取数据,这不是问题。似乎我无法定义要提取的元素。

这是我使用的脚本。我希望有人在这方面比我有更多经验,并且能够确定我应该在 html_elements() 函数中输入什么,因为 .css-105mfs8 似乎不起作用。

#1 Setup Packages

library(pacman)
p_load('rvest', 'rlang')

#2 Load Website Course Data
URL = 'https://ocasys.rug.nl/current/catalog'
document = read_html(URL)

html_programmes <- document %>% html_elements('li.css-105mfs8')

a_element <- html_programmes %>% html_element("a") 

programme_urls <- html_products %>% 
  html_element("a") %>% 
  html_attr("href") 

programmes <- data.frame( 
  programme_urls, 
  #corresponding name
)
html r web-scraping rvest
1个回答
1
投票

您在浏览器的元素检查器中看到的那个页面和 DOM 树是由 javascript 修改的,这意味着加载到 revest 中的内容是完全不同的,那些选择器不会在那里工作。您可以在浏览器中为该特定页面关闭 JS,以了解 rvest 中仍然可用的内容。

虽然整个目录都在一个 JSON 中 - https://ocasys.rug.nl/api/faculty/catalog/2022-2023 - 它可以转换为表格形式。很少有人注意到这里发生了什么 -

hoist()
从嵌套列表中提取一些特定元素,
unnest_longer()
保留列数但添加行(即我们从 14 个部门的列表开始,在
unnest_longer()
之后我们有 434 个程序的列表),
map_chr()
用于将 1..2 项的列表转换为字符串(有些程序有 2 种语言,大多数有 1 种)。

关于矩形的更多信息 - https://tidyr.tidyverse.org/articles/rectangle.html .

library(jsonlite)
library(dplyr)
library(tidyr)
library(purrr)

faculty <- fromJSON("https://ocasys.rug.nl/api/faculty/catalog/2022-2023", simplifyVector = FALSE) %>% 
  tibble(faculty = .) %>% 
  hoist(faculty, "titleEN", "programs") %>% 
  unnest_longer(programs) %>% 
  hoist(programs, "language", "levels", programTitle = "titleEN", "code") %>% 
  mutate(language = map_chr(language, \(x) paste0(x, collapse = ", "))) %>% 
  mutate(levels = map_chr(levels, \(x) paste0(x, collapse = ", "))) %>% 
  mutate(url = paste0("https://ocasys.rug.nl/current/catalog/programme/",code)) %>% 
  select(-c(programs, faculty))

结果:

faculty
#> # A tibble: 434 × 6
#>    titleEN                 language levels   programTitle            code  url  
#>    <chr>                   <chr>    <chr>    <chr>                   <chr> <chr>
#>  1 Science and Engineering ENGLISH  BACHELOR "BSc Astronomy"         50205 http…
#>  2 Science and Engineering ENGLISH  MASTER   "MSc Applied Physics "  60436 http…
#>  3 Science and Engineering ENGLISH  BACHELOR "BSc Computing Science" 56978 http…
#>  4 Science and Engineering ENGLISH  MASTER   "MSc Biomedical Engine… 6622… http…
#>  5 Science and Engineering ENGLISH  MASTER   "MSc Energy and Enviro… 60608 http…
#>  6 Science and Engineering ENGLISH  EXCHANGE "BSc Courses for Excha… WBEX… http…
#>  7 Science and Engineering ENGLISH  MASTER   "MSc Physics: Quantum … 6020… http…
#>  8 Science and Engineering ENGLISH  BACHELOR "BSc Pharmacy "         56157 http…
#>  9 Science and Engineering ENGLISH  BACHELOR "Pre-master/Fast-track… Pre-… http…
#> 10 Science and Engineering DUTCH    EXCHANGE "BSc Courses for Excha… WBEX… http…
#> # … with 424 more rows
glimpse(faculty)
#> Rows: 434
#> Columns: 6
#> $ titleEN      <chr> "Science and Engineering", "Science and Engineering", "Sc…
#> $ language     <chr> "ENGLISH", "ENGLISH", "ENGLISH", "ENGLISH", "ENGLISH", "E…
#> $ levels       <chr> "BACHELOR", "MASTER", "BACHELOR", "MASTER", "MASTER", "EX…
#> $ programTitle <chr> "BSc Astronomy", "MSc Applied Physics ", "BSc Computing S…
#> $ code         <chr> "50205", "60436", "56978", "66226-5506", "60608", "WBEXFA…
#> $ url          <chr> "https://ocasys.rug.nl/current/catalog/programme/50205", …

或者像这样的东西,允许

simplifyVector
fromJSON()
(它默认启用),unnest(几乎)一切并希望最好:

fromJSON("https://ocasys.rug.nl/api/faculty/catalog/2022-2023") %>% 
  as_tibble() %>% 
  unnest(programs, names_sep = ".") %>% 
  unnest(status, names_sep = ".") 
#> # A tibble: 434 × 13
#>    id      code  titleEN titleNL progr…¹ progr…² progr…³ progr…⁴ progr…⁵ progr…⁶
#>    <chr>   <chr> <chr>   <chr>   <chr>   <list>  <list>  <chr>   <chr>   <chr>  
#>  1 2591c6… fwn   Scienc… Scienc… 2a2243… <chr>   <chr>   "BSc A… "BSc S… 50205  
#>  2 2591c6… fwn   Scienc… Scienc… 426809… <chr>   <chr>   "MSc A… "MSc A… 60436  
#>  3 2591c6… fwn   Scienc… Scienc… 785b8f… <chr>   <chr>   "BSc C… "BSc C… 56978  
#>  4 2591c6… fwn   Scienc… Scienc… 813944… <chr>   <chr>   "MSc B… "MSc B… 66226-…
#>  5 2591c6… fwn   Scienc… Scienc… 3a1b3d… <chr>   <chr>   "MSc E… "MSc E… 60608  
#>  6 2591c6… fwn   Scienc… Scienc… eda736… <chr>   <chr>   "BSc C… "BSc C… WBEXFA 
#>  7 2591c6… fwn   Scienc… Scienc… f2b8f8… <chr>   <chr>   "MSc P… "MSc P… 60202-…
#>  8 2591c6… fwn   Scienc… Scienc… 030967… <chr>   <chr>   "BSc P… "BSc F… 56157  
#>  9 2591c6… fwn   Scienc… Scienc… a786ea… <chr>   <chr>   "Pre-m… "Pre-m… Pre-Fa…
#> 10 2591c6… fwn   Scienc… Scienc… 8fa491… <chr>   <chr>   "BSc C… "BSc C… WBEXIE 
#> # … with 424 more rows, 3 more variables: status.year <chr>,
#> #   status.visible <lgl>, status.export <lgl>, and abbreviated variable names
#> #   ¹​programs.id, ²​programs.language, ³​programs.levels, ⁴​programs.titleEN,
#> #   ⁵​programs.titleNL, ⁶​programs.code

创建于 2023-03-21 与 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.