尝试从 this URL 上的主表中提取数据,但
read_html
而是从隐藏在页面顶部下拉菜单中的表中提取数据。 这个网址
我尝试指定表号,但它只是给了我标题,没有任何实际数据:
url <- "https://www.fangraphs.com/leaders/major-league?pos=all&stats=pit&lg=all&season=2024&season1=2024&ind=0&qual=1&pageitems=2000000000&v_cr=202301&type=c%2C7%2C31%2C13%2C16%2C24%2C19%2C15%2C212&month=0"
page <- read_html(url)
player_stats <- html_table(html_nodes(page, "table"))[[10]]
player_stats
# A tibble: 0 × 11
# … with 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>, RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>, RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>
# ℹ Use `colnames()` to see all variable names
我认为真正的解决方案是使用 xpath 来准确指定我试图从中提取的页面上的位置,但我不太确定该怎么做。我打开了开发人员工具,但没有看到表 ID,想知道是否还有其他元素可以使用。
选择第
[[10]]
个元素有点随意,虽然您可能会在 HTML 页面上看到它之前的 9 个表格,但您看到的顺序并不总是相同的,并且经常有嵌入的组件不可见(由于多种原因) 。我建议 [[10]]
不是最好的选择。
虽然通常最好通过 ID 或其他一些明确的属性来选择表,但有时我们必须求助于查找有关表的已知属性,例如列名。
查看所有表格,
page_static <- read_html(url)
html_nodes(page_static, "table")
# {xml_nodeset (10)}
# [1] <table class="menu-scores-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL Games</div>\n<table><tbody></tbody></table>\n</td>\n<td class="menu-ta ...
# [2] <table><tbody></tbody></table>
# [3] <table><tbody></tbody></table>
# [4] <table class="menu-standings-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<div id="menu-standings-ale"><div></div></div>\n<div cl ...
# [5] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/teams/blue-jays">Blue Jays</a ...
# [6] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/roster-resource/depth-charts/ ...
# [7] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">BAL</td>\n<td><a href="https://blogs.fangraphs.com/chicago-white-sox-top-27-prospects">CH ...
# [8] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">ATL</td>\n<td><a class="team-box_state__prelim__jxGiG" href="https://blogs.fangraphs.com/ ...
# [9] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left fixed ">Name</th>\n<th dat ...
# [10] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left fixed ">Name</th>\n<th dat ...
没有什么引人注目的......让我们探索一下每张桌子的样子:
html_nodes(page_static, "table") |>
lapply(html_table)
# [[1]]
# # A tibble: 1 × 3
# X1 X2 X3
# <chr> <lgl> <chr>
# 1 AL Games NA NL Games
# [[2]]
# # A tibble: 0 × 0
# [[3]]
# # A tibble: 0 × 0
# [[4]]
# # A tibble: 1 × 3
# X1 X2 X3
# <chr> <lgl> <chr>
# 1 AL EastAL CentralAL West NA NL EastNL CentralNL West
# [[5]]
# # A tibble: 2 × 3
# X1 X2 X3
# <chr> <chr> <chr>
# 1 AL EastBlue Jays | DCOrioles | DCRays | DCRed Sox | DCYankees | DC AL CentralGuardians | DCRoyals | DCTigers | DCTwins | DCWhite Sox |… AL W…
# 2 NL EastBraves | DCMarlins | DCMets | DCNationals | DCPhillies | DC NL CentralBrewers | DCCardinals | DCCubs | DCPirates | DCReds | DC NL W…
# [[6]]
# # A tibble: 2 × 3
# X1 X2 X3
# <chr> <chr> <chr>
# 1 AL EastBlue JaysOriolesRaysRed SoxYankees AL CentralGuardiansRoyalsTigersTwinsWhite Sox AL WestAngelsAstrosAthleticsMarinersRangers
# 2 NL EastBravesMarlinsMetsNationalsPhillies NL CentralBrewersCardinalsCubsPiratesReds NL WestD-backsDodgersGiantsPadresRockies
# [[7]]
# # A tibble: 5 × 3
# X1 X2 X3
# <chr> <chr> <chr>
# 1 BAL CHW LAA
# 2 BOS CLE OAK
# 3 NYY DET SEA
# 4 TBR KCR TEX
# 5 TOR MIN HOU
# [[8]]
# # A tibble: 5 × 3
# X1 X2 X3
# <chr> <chr> <chr>
# 1 ATL CHC* ARI
# 2 MIA CIN COL
# 3 WSN MIL LAD
# 4 NYM* PIT SDP*
# 5 PHI STL SFG
# [[9]]
# # A tibble: 0 × 11
# # ℹ 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>,
# # RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>,
# # RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>
# [[10]]
# # A tibble: 0 × 11
# # ℹ 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>,
# # RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>,
# # RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>
是的,我看到你希望得到的“第 10 个”(可能是第 9 个?),但显然它是空的。首先,我没有看到
id=
之类的东西可以清楚地将这两个表与其他表区分开来。其次,它是空的,表明数据是通过javascript填写的。
幸运的是,
rvest_1.0.4
添加了rvest::read_html_live
,它使用无头Chrome(可能需要一些操作系统组件?),允许javascript组件工作,以便您可以收集必要的数据。
我将使用它,并建议一种动态获取您想要的表的方法。
page_dynamic <- read_html_live(url)
html_nodes(page_dynamic, "table")
# {xml_nodeset (16)}
# [1] <table class="menu-scores-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL Games</div>\n<table><tbody>\n<tr>\n<td><a href="https://www.fangraphs. ...
# [2] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/livewins.aspx?date=2024-04-05&team=Yankees&dh=0&season=2024">TOR (3) @ NYY (0)</a ...
# [3] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/livewins.aspx?date=2024-04-05&team=Cubs&dh=0&season=2024">LAD (7) @ CHC (9)</a></ ...
# [4] <table class="menu-standings-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<div id="menu-standings-ale"><table><tbody>\n<tr>\n<td> ...
# [5] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [6] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [7] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [8] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [9] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [10] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season=2024&month ...
# [11] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/teams/blue-jays">Blue Jays</a ...
# [12] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/roster-resource/depth-charts/ ...
# [13] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">BAL</td>\n<td><a href="https://blogs.fangraphs.com/chicago-white-sox-top-27-prospects">CH ...
# [14] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">ATL</td>\n<td><a class="team-box_state__prelim__jxGiG" href="https://blogs.fangraphs.com/ ...
# [15] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left fixed ">Name</th>\n<th dat ...
# [16] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left fixed ">Name</th>\n<th dat ...
好吧,还有很多桌子。如果我们执行上面使用的
|> lapply(html_table)
技巧,我们会看到这次数据实际上已填充。查看这些表,其中只有两个具有列名称 "Name"
和 "Team"
(以及其他),因此让我们过滤这些表以检索数据。
my_tables <- html_nodes(page_dynamic, "table") |>
lapply(html_table) |>
Filter(\(tb) "Name" %in% names(tb), x = _)
my_tables
# [[1]]
# # A tibble: 409 × 11
# `#` Name Team `GG - Games Pitched` PitchesPitches - Pit…¹ IPIP - Innings Pitch…² `RR - Runs Allowed` `SOSO - Strikeouts` `BBBB - Walks` `HH - Hits Allowed`
# <int> <chr> <chr> <int> <int> <dbl> <int> <int> <int> <int>
# 1 1 Shan… CLE 2 166 12 0 20 1 10
# 2 2 Brad… KCR 2 171 13.1 1 14 2 5
# 3 3 Cris… HOU 2 187 11 0 9 6 5
# 4 4 Fran… CIN 2 181 11.2 1 9 3 9
# 5 5 Seth… KCR 2 172 12.2 1 7 3 10
# 6 6 Rone… HOU 1 106 9 0 7 2 0
# 7 7 Zac … ARI 2 186 11 1 9 5 6
# 8 8 Merr… ARI 2 170 13.2 3 12 1 8
# 9 9 Jord… SFG 2 172 12 2 11 1 8
# 10 10 Garr… CHW 2 180 13 2 16 1 8
# # ℹ 399 more rows
# # ℹ abbreviated names: ¹`PitchesPitches - Pitches Thrown`, ²`IPIP - Innings Pitched`
# # ℹ 1 more variable: `RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed` <dbl>
# # ℹ Use `print(n = ...)` to see more rows
# [[2]]
# # A tibble: 409 × 11
# `#` Name Team `GG - Games Pitched` PitchesPitches - Pit…¹ IPIP - Innings Pitch…² `RR - Runs Allowed` `SOSO - Strikeouts` `BBBB - Walks` `HH - Hits Allowed`
# <int> <chr> <chr> <int> <int> <dbl> <int> <int> <int> <int>
# 1 1 Shan… CLE 2 166 12 0 20 1 10
# 2 2 Brad… KCR 2 171 13.1 1 14 2 5
# 3 3 Cris… HOU 2 187 11 0 9 6 5
# 4 4 Fran… CIN 2 181 11.2 1 9 3 9
# 5 5 Seth… KCR 2 172 12.2 1 7 3 10
# 6 6 Rone… HOU 1 106 9 0 7 2 0
# 7 7 Zac … ARI 2 186 11 1 9 5 6
# 8 8 Merr… ARI 2 170 13.2 3 12 1 8
# 9 9 Jord… SFG 2 172 12 2 11 1 8
# 10 10 Garr… CHW 2 180 13 2 16 1 8
# # ℹ 399 more rows
# # ℹ abbreviated names: ¹`PitchesPitches - Pitches Thrown`, ²`IPIP - Innings Pitched`
# # ℹ 1 more variable: `RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed` <dbl>
# # ℹ Use `print(n = ...)` to see more rows
在这种情况下,两个表恰好是相同的(这看起来很奇怪……为什么 HTML 中有两个相同的表?耸耸肩),但在某些情况下它们可能会不同。我将把探索留给你。