使用 R 中的 rvest 进行网络抓取

Question

我想提取整齐的影响力排名表，其中包括排名、大学名称、SDG 目标 1、SDG 目标 2、SDG 目标 3、SDG 目标 4 以及总分、国家/地区。以下是我尝试过的代码。

library(tidyverse)
library(rvest)

link <- "https://www.timeshighereducation.com/impactrankings"
selector <- "#block-system-main > div > div.container"
webpage <- read_html(link)
data <- html_nodes(webpage, selector)
content <- html_text(data)

content

上面的代码没有达到我的目的。我正在尝试从网页上的大学影响排名表中提取数据并将其组织成整齐的格式。

Answer 1

针对 rvest 问题的一般提醒适用：在使用 rvest 之前，请尝试在浏览器中阻止 Javascript，重新加载页面，然后查看是否出现您想要的信息。如果没有，信息很可能是用 Javascript 加载的，所以 rvest 不是完成这项工作的工具。

备用选项是：

希望运气好并在检查器的“网络”选项卡中看到 JSON 文件
或使用硒。

在这种情况下，我们很幸运：

我们想要的文件是以

world_impact_rankings

开头的文件。我们可以通过以下方式获取URL：

在浏览器中打开网页（截图是Chrome的，其他的都是一样的）
右键单击页面任意位置，然后打开“检查”。应该会弹出一些东西，可能在浏览器窗口的底部
在新区域（检查器）中，其顶部应该有一个“网络”选项卡，如我的屏幕截图所示。打开那个。
根据我的经验，您始终必须重新加载页面，才能让“网络”选项卡开始列出网络请求，因此请重新加载页面。您现在应该看到记录了许多请求。
使用左上角的过滤器搜索框将请求过滤为仅包含 JSON 文件的请求
找到你要找的请求，右键>复制>复制链接地址（如果你用火狐可能会说复制URL，都是一样的）

现在我们有了 URL，我们可以使用 JSONlite 将其加载到 R 中：

# (install if necessary, and) load jsonlite
pacman::p_load(jsonlite)


# it's best for you to do this yourself, as 1. the link is likely going to change into the future, and 2. it's best if I don't spoonfeed you everything :P
link <- "insert your link here"

# load the data
data <- fromJSON(link)

names(data) # [1] "data"      "subjects"  "locations" "pillars"

JSON 文件包含四组不同的数据：“主题”，这看起来像是主题及其相应代码的命名列表，“位置”（每个国家的某种 ID），“支柱”（可能是内部模型）数据，以及“数据”，即其他一切，我想这就是您正在寻找的东西。

df <- data$data |> dplyr::as_tibble() # I used as_tibble() here because R's default way of displaying a dataframe is to print the entire thing, which is a bit of a mess when it has 1591 rows and 22 columns, as in the example below. R knows not to print every row nor every column of a tibble, but converting it to a tibble() isn't strictly necessary
df

输出：

# A tibble: 1,591 × 22
   rank_order rank  name          scores_overall scores_overall_rank record_type
   <chr>      <chr> <chr>         <chr>          <chr>               <chr>      
 1 1          1     Western Sydn… 99.4           1                   master_acc…
 2 2          2     University o… 97.5           2                   master_acc…
 3 3          3     Queen’s Univ… 97.2           3                   master_acc…
 4 4          4     Universiti S… 96.9           4                   master_acc…
 5 5          5     University o… 96.6           5                   master_acc…
 6 6          6     Arizona Stat… 96.5           6                   public     
 7 7          =7    University o… 96.4           7                   master_acc…
 8 8          =7    RMIT Univers… 96.4           8                   master_acc…
 9 9          =9    Aalborg Univ… 95.8           9                   master_acc…
10 10         =9    University o… 95.8           10                  master_acc…
# ℹ 1,581 more rows
# ℹ 16 more variables: member_level <chr>, url <chr>, nid <int>,
#   location <chr>, stats_number_students <chr>,
#   stats_student_staff_ratio <chr>, stats_pc_intl_students <chr>,
#   stats_female_male_ratio <chr>, aliases <chr>, subjects_offered <chr>,
#   best_scores <chr>, closed <lgl>, unaccredited <lgl>, disabled <lgl>,
#   apply_link <chr>, cta_button <df[,2]>

使用 R 中的 rvest 进行网络抓取

问题描述投票：0回答：1

1个回答

最新问题

使用 R 中的 rvest 进行网络抓取

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1