我正在尝试编写代码,让我能够从篮球网站cleaningtheglass.com 上抓取投篮准确度表。我尝试找到 CSS 选择器来提取表格,但我一定是做错了什么,因为我一直什么也没得到。
这是我的代码:
library(rvest)
library(tidyverse)
url <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"
# Read the HTML content of the webpage
webpage <- url %>%
read_html()
# Use the specific CSS selector for the table
table_data <- page %>%
html_nodes('#shooting_overall > div.stat_table_container')
我做错了什么?
这是一个有点有趣的案例。选择器不起作用的原因已在注释中说明 - 这是一个 JavaScript 驱动的网站,您在 CSS 选择器或浏览器开发工具检查器中看到的内容与可用的实际页面源有很大不同
rvest
。尽管该表数据嵌入到站点的源中并在单个响应中传递,但这本身并不少见。但该数据集的大小约为 13MB,并且全部被压缩到单个 <script> .. </script>
元素中;显然 rvest::html_text()
无法提取全部内容并返回截断的字符串。
因此,我们可以使用
rvest
加载页面内容并处理文本行,而不是 httr(2)
;找到相关的 js 函数调用后,我们可以提取函数参数(JavaScript 数组和对象),每个参数都单独一行。一旦这些字符串足够干净(例如没有尾随逗号,额外的空格就可以),我们就可以将这些对象解析为 JSON 字符串。
library(httr2)
library(readr)
library(dplyr)
url_ <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"
# read html as lines
html_l <-
request(url_) |>
req_perform() |>
resp_body_string() |>
read_lines()
# locate target js assignment and get get relevant vuePlayers() function argument values
idx_anchor <- which(html_l == " window.vuePlayerFilter = vuePlayers(")
players <- html_l[(idx_anchor+1):(idx_anchor+4)]
# names from js function arguments
names(players) <- c("allPlayerData", "onOffTeamData", "onOffOpponentData", "statCategoryMappings")
# check start & end for anything that might cause issues for jsonlite
tibble(arg = names(players),
start = sapply(players, str_trunc, 20, side ="right"),
end = sapply(players, str_trunc, 20, side ="left"))
#> Error: object 'str_trunc' not found
# remove trailing commas
players <- sapply(players, \(x) gsub(",$", "", x))
# parse all function arguments as JSONs
players <- lapply(players, jsonlite::fromJSON)
# looks like we have a view config for default table
players$statCategoryMappings[[2]] |> str()
#> List of 2
#> $ : chr "shooting_overall"
#> $ :'data.frame': 8 obs. of 4 variables:
#> ..$ abbr: chr [1:8] "efg_perc" "fg2_perc" "fg3_perc" "ft_perc" ...
#> ..$ type: chr [1:8] "percent1" "percent1" "percent1" "percent1" ...
#> ..$ name: chr [1:8] "eFG%" "2P%" "3P%" "FT%" ...
#> ..$ sort: int [1:8] NA NA NA NA 0 0 0 0
# named column name vector for select
stat_map <-
players$statCategoryMappings[[2]][[2]][,c("name", "abbr")] |>
mutate(name = gsub("<br />", " ", name, fixed = TRUE)) |>
tibble::deframe()
stat_map
#> eFG% 2P% 3P% FT%
#> "efg_perc" "fg2_perc" "fg3_perc" "ft_perc"
#> ASTD% All ASTD% Rim ASTD% Mid ASTD% Three
#> "astd_perc" "astd_rim_perc" "astd_nr2_perc" "astd_three_perc"
# allPlayerData, select a subset that matches with the site's default table
# (total number of columns is 111)
players$allPlayerData %>%
as_tibble() %>%
select(name, age, team = team_name, pos = pos_category,
sec_played = seconds_played, all_of(stat_map))
结果:
#> # A tibble: 306 × 13
#> name age team pos sec_played `eFG%` `2P%` `3P%` `FT%` `ASTD% All`
#> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Precious … 24.1 TOR big 10311 0.509 0.548 0.267 0.778 0.815
#> 2 Bam Adeba… 26.3 MIA big 26814 0.540 0.538 0.5 0.825 0.556
#> 3 Ochai Agb… 23.5 UTA wing 16443 0.542 0.5 0.382 0.667 0.846
#> 4 Santi Ald… 22.8 MEM big 12095 0.543 0.519 0.372 0.6 0.814
#> 5 Nickeil A… 25.2 MIN wing 16807 0.565 0.6 0.364 0.333 0.774
#> 6 Grayson A… 28.1 PHX wing 28936 0.627 0.5 0.474 0.864 0.705
#> 7 Jarrett A… 25.5 CLE big 16405 0.622 0.622 NA 0.756 0.804
#> 8 Kyle Ande… 30.1 MIN forw… 20847 0.579 0.612 0.222 0.581 0.535
#> 9 Giannis A… 28.9 MIL big 27676 0.615 0.647 0.222 0.625 0.45
#> 10 Cole Anth… 23.5 ORL point 22301 0.5 0.478 0.351 0.848 0.484
#> # ℹ 296 more rows
#> # ℹ 3 more variables: `ASTD% Rim` <dbl>, `ASTD% Mid` <dbl>, `ASTD% Three` <dbl>
创建于 2023 年 11 月 25 日,使用 reprex v2.0.2