我正在尝试从以下 URL 中抓取单个表格:https://baseballsavant.mlb.com/league?season=2023#statcastHitting。然而,我的尝试要么在更宽的页面上抓取多个表,要么得到 0x0 的输出小标题。我只是想抓取页面顶部附近的“Statcast Hitting”表,没有扳手标题。我使用 Selector Gadget 尝试查明正确的节点,但我怀疑我没有正确引用 html。下面显示了我尝试抓取并转换为数据框的图像。
我已经尝试过好几次了,下面有几次失败的尝试。
library(rvest)
library(tidyverse)
url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'
savant_teams <- url %>% read_html %>% html_node('#statcastHitting') %>%
html_table()
savant_teams
library(tidyverse)
library(rvest)
url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'
savant_teams <- url %>% read_html %>% html_node('#statcast_th-8 .tablesorter-header-inner , #statcast_th-7 , #statcast_th-6 .tablesorter-header-inner , #statcast_th-5 .tablesorter-header-inner , #statcast_th-10 .tablesorter-header-inner , #statcast_th-4 .tablesorter-header-inner , #statcast_th-2 .tablesorter-header-inner , #statcast_th-9 , #statcast_th-1 .tablesorter-header-inner , #statcast_th-3 .tablesorter-header-inner , .tablesorterb23e763259572 #statcast_th-0 .tablesorter-header-inner , #scg_ span') %>%
html_table()
savant_teams
桌子用
div
包裹,类别为 table-savant
。由于您要抓取的表格是第一个,因此您可以使用选择器 #statcastHitting div.table-savant
选择第一个 div
以仅获取其中的表格 div
:
library(rvest)
library(tidyverse)
url <- 'https://baseballsavant.mlb.com/league?season=2023#statcastHitting'
savant_teams <- url %>%
read_html() %>%
html_node('#statcastHitting div.table-savant') %>%
html_table()
savant_teams
#> # A tibble: 32 × 27
#> `` `` `Standard Stats` `Standard Stats` `Standard Stats`
#> <chr> <chr> <chr> <chr> <chr>
#> 1 "Team" Season PA AB H
#> 2 "" 2023 6,249 5,597 1,543
#> 3 "" 2023 5,985 5,428 1,325
#> 4 "" 2023 6,207 5,541 1,417
#> 5 "" 2023 5,980 5,501 1,308
#> 6 "" 2023 6,253 5,567 1,441
#> 7 "" 2023 6,219 5,489 1,336
#> 8 "" 2023 5,966 5,311 1,187
#> 9 "" 2023 6,164 5,511 1,432
#> 10 "" 2023 6,180 5,401 1,316
#> # ℹ 22 more rows
#> # ℹ 22 more variables: `Standard Stats` <chr>, `Standard Stats` <chr>,
#> # `Standard Stats` <chr>, `Standard Stats` <chr>, `Standard Stats` <chr>,
#> # `Standard Stats` <chr>, `Standard Stats` <chr>, `Standard Stats` <chr>,
#> # `Standard Stats` <chr>, `Standard Stats` <chr>, Statcast <chr>,
#> # Statcast <chr>, Statcast <chr>, Statcast <chr>, Statcast <chr>,
#> # Statcast <chr>, Statcast <chr>, Statcast <chr>, Statcast <chr>, …