问题刮“清洁玻璃”表

问题描述 投票:0回答:1

我正在尝试编写代码,让我能够从篮球网站cleaningtheglass.com 上抓取投篮准确度表。我尝试找到 CSS 选择器来提取表格,但我一定是做错了什么,因为我一直什么也没得到。

这是我的代码:

library(rvest)
library(tidyverse)

url <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"

# Read the HTML content of the webpage
webpage <- url %>%
  read_html()
  

# Use the specific CSS selector for the table
table_data <- page %>%
  html_nodes('#shooting_overall > div.stat_table_container') 

我做错了什么?

r web-scraping rvest
1个回答
0
投票

这是一个有点有趣的案例。选择器不起作用的原因已在注释中说明 - 这是一个 JavaScript 驱动的网站,您在 CSS 选择器或浏览器开发工具检查器中看到的内容与可用的实际页面源有很大不同

rvest
。尽管该表数据嵌入到站点的源中并在单个响应中传递,但这本身并不少见。但该数据集的大小约为 13MB,并且全部被压缩到单个
<script> .. </script>
元素中;显然
rvest::html_text()
无法提取全部内容并返回截断的字符串。

因此,我们可以使用

rvest
加载页面内容并处理文本行,而不是
httr(2)
;找到相关的 js 函数调用后,我们可以提取函数参数(JavaScript 数组和对象),每个参数都单独一行。一旦这些字符串足够干净(例如没有尾随逗号,额外的空格就可以),我们就可以将这些对象解析为 JSON 字符串。

library(httr2)
library(readr)
library(dplyr)

url_ <- "https://cleaningtheglass.com/stats/players?stat_category=shooting_overall#/"

# read html as lines
html_l <- 
  request(url_) |>
  req_perform() |>
  resp_body_string() |>
  read_lines()

# locate target js assignment and get get relevant vuePlayers() function argument values
idx_anchor <- which(html_l == "        window.vuePlayerFilter = vuePlayers(")
players <- html_l[(idx_anchor+1):(idx_anchor+4)]

# names from js function arguments
names(players) <- c("allPlayerData", "onOffTeamData", "onOffOpponentData", "statCategoryMappings")

# check start & end for anything that might cause issues for jsonlite
tibble(arg   = names(players), 
       start = sapply(players, str_trunc, 20, side ="right"),
       end   = sapply(players, str_trunc, 20, side ="left"))
#> Error: object 'str_trunc' not found
# remove trailing commas 
players <- sapply(players, \(x) gsub(",$", "", x))

# parse all function arguments as JSONs
players <- lapply(players, jsonlite::fromJSON)

# looks like we have a view config for default table 
players$statCategoryMappings[[2]] |> str()
#> List of 2
#>  $ : chr "shooting_overall"
#>  $ :'data.frame':    8 obs. of  4 variables:
#>   ..$ abbr: chr [1:8] "efg_perc" "fg2_perc" "fg3_perc" "ft_perc" ...
#>   ..$ type: chr [1:8] "percent1" "percent1" "percent1" "percent1" ...
#>   ..$ name: chr [1:8] "eFG%" "2P%" "3P%" "FT%" ...
#>   ..$ sort: int [1:8] NA NA NA NA 0 0 0 0

# named column name vector for select
stat_map <- 
  players$statCategoryMappings[[2]][[2]][,c("name", "abbr")] |> 
  mutate(name = gsub("<br />", " ", name, fixed = TRUE)) |>
  tibble::deframe()
stat_map
#>              eFG%               2P%               3P%               FT% 
#>        "efg_perc"        "fg2_perc"        "fg3_perc"         "ft_perc" 
#>         ASTD% All         ASTD% Rim         ASTD% Mid       ASTD% Three 
#>       "astd_perc"   "astd_rim_perc"   "astd_nr2_perc" "astd_three_perc"

# allPlayerData, select a subset that matches with the site's default table
# (total number of columns is 111)
players$allPlayerData %>% 
  as_tibble() %>% 
  select(name, age, team = team_name, pos = pos_category, 
         sec_played = seconds_played, all_of(stat_map)) 

结果:

#> # A tibble: 306 × 13
#>    name         age team  pos   sec_played `eFG%` `2P%`  `3P%` `FT%` `ASTD% All`
#>    <chr>      <dbl> <chr> <chr>      <dbl>  <dbl> <dbl>  <dbl> <dbl>       <dbl>
#>  1 Precious …  24.1 TOR   big        10311  0.509 0.548  0.267 0.778       0.815
#>  2 Bam Adeba…  26.3 MIA   big        26814  0.540 0.538  0.5   0.825       0.556
#>  3 Ochai Agb…  23.5 UTA   wing       16443  0.542 0.5    0.382 0.667       0.846
#>  4 Santi Ald…  22.8 MEM   big        12095  0.543 0.519  0.372 0.6         0.814
#>  5 Nickeil A…  25.2 MIN   wing       16807  0.565 0.6    0.364 0.333       0.774
#>  6 Grayson A…  28.1 PHX   wing       28936  0.627 0.5    0.474 0.864       0.705
#>  7 Jarrett A…  25.5 CLE   big        16405  0.622 0.622 NA     0.756       0.804
#>  8 Kyle Ande…  30.1 MIN   forw…      20847  0.579 0.612  0.222 0.581       0.535
#>  9 Giannis A…  28.9 MIL   big        27676  0.615 0.647  0.222 0.625       0.45 
#> 10 Cole Anth…  23.5 ORL   point      22301  0.5   0.478  0.351 0.848       0.484
#> # ℹ 296 more rows
#> # ℹ 3 more variables: `ASTD% Rim` <dbl>, `ASTD% Mid` <dbl>, `ASTD% Three` <dbl>

创建于 2023 年 11 月 25 日,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.