使用read_html时如何识别要抓取哪个元素?

问题描述 投票:0回答:1

尝试从 this URL 上的主表中提取数据,但

read_html
而是从隐藏在页面顶部下拉菜单中的表中提取数据。 这个网址

我尝试指定表号,但它只是给了我标题,没有任何实际数据:

url <- "https://www.fangraphs.com/leaders/major-league?pos=all&stats=pit&lg=all&season=2024&season1=2024&ind=0&qual=1&pageitems=2000000000&v_cr=202301&type=c%2C7%2C31%2C13%2C16%2C24%2C19%2C15%2C212&month=0"
page <- read_html(url)
player_stats <- html_table(html_nodes(page, "table"))[[10]]
player_stats
# A tibble: 0 × 11
# … with 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>, RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>, RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>
# ℹ Use `colnames()` to see all variable names

我认为真正的解决方案是使用 xpath 来准确指定我试图从中提取的页面上的位置,但我不太确定该怎么做。我打开了开发人员工具,但没有看到表 ID,想知道是否还有其他元素可以使用。

r web-scraping rvest
1个回答
0
投票

选择第

[[10]]
个元素有点随意,虽然您可能会在 HTML 页面上看到它之前的 9 个表格,但您看到的顺序并不总是相同的,并且经常有嵌入的组件不可见(由于多种原因) 。我建议
[[10]]
不是最好的选择。

虽然通常最好通过 ID 或其他一些明确的属性来选择表,但有时我们必须求助于查找有关表的已知属性,例如列名。

查看所有表格,

page_static <- read_html(url)
html_nodes(page_static, "table")
# {xml_nodeset (10)}
#  [1] <table class="menu-scores-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL Games</div>\n<table><tbody></tbody></table>\n</td>\n<td class="menu-ta ...
#  [2] <table><tbody></tbody></table>
#  [3] <table><tbody></tbody></table>
#  [4] <table class="menu-standings-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<div id="menu-standings-ale"><div></div></div>\n<div cl ...
#  [5] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/teams/blue-jays">Blue Jays</a ...
#  [6] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/roster-resource/depth-charts/ ...
#  [7] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">BAL</td>\n<td><a href="https://blogs.fangraphs.com/chicago-white-sox-top-27-prospects">CH ...
#  [8] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">ATL</td>\n<td><a class="team-box_state__prelim__jxGiG" href="https://blogs.fangraphs.com/ ...
#  [9] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left  fixed ">Name</th>\n<th dat ...
# [10] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left  fixed ">Name</th>\n<th dat ...

没有什么引人注目的......让我们探索一下每张桌子的样子:

html_nodes(page_static, "table") |>
  lapply(html_table)
# [[1]]
# # A tibble: 1 × 3
#   X1       X2    X3      
#   <chr>    <lgl> <chr>   
# 1 AL Games NA    NL Games
# [[2]]
# # A tibble: 0 × 0
# [[3]]
# # A tibble: 0 × 0
# [[4]]
# # A tibble: 1 × 3
#   X1                       X2    X3                      
#   <chr>                    <lgl> <chr>                   
# 1 AL EastAL CentralAL West NA    NL EastNL CentralNL West
# [[5]]
# # A tibble: 2 × 3
#   X1                                                                           X2                                                                            X3   
#   <chr>                                                                        <chr>                                                                         <chr>
# 1 AL EastBlue Jays  |  DCOrioles  |  DCRays  |  DCRed Sox  |  DCYankees  |  DC AL CentralGuardians  |  DCRoyals  |  DCTigers  |  DCTwins  |  DCWhite Sox  |… AL W…
# 2 NL EastBraves  |  DCMarlins  |  DCMets  |  DCNationals  |  DCPhillies  |  DC NL CentralBrewers  |  DCCardinals  |  DCCubs  |  DCPirates  |  DCReds  |  DC  NL W…
# [[6]]
# # A tibble: 2 × 3
#   X1                                        X2                                            X3                                         
#   <chr>                                     <chr>                                         <chr>                                      
# 1 AL EastBlue JaysOriolesRaysRed SoxYankees AL CentralGuardiansRoyalsTigersTwinsWhite Sox AL WestAngelsAstrosAthleticsMarinersRangers
# 2 NL EastBravesMarlinsMetsNationalsPhillies NL CentralBrewersCardinalsCubsPiratesReds     NL WestD-backsDodgersGiantsPadresRockies   
# [[7]]
# # A tibble: 5 × 3
#   X1    X2    X3   
#   <chr> <chr> <chr>
# 1 BAL   CHW   LAA  
# 2 BOS   CLE   OAK  
# 3 NYY   DET   SEA  
# 4 TBR   KCR   TEX  
# 5 TOR   MIN   HOU  
# [[8]]
# # A tibble: 5 × 3
#   X1    X2    X3   
#   <chr> <chr> <chr>
# 1 ATL   CHC*  ARI  
# 2 MIA   CIN   COL  
# 3 WSN   MIL   LAD  
# 4 NYM*  PIT   SDP* 
# 5 PHI   STL   SFG  
# [[9]]
# # A tibble: 0 × 11
# # ℹ 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>,
# #   RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>,
# #   RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>
# [[10]]
# # A tibble: 0 × 11
# # ℹ 11 variables: # <lgl>, Name <lgl>, Team <lgl>, GG - Games Pitched <lgl>, PitchesPitches - Pitches Thrown <lgl>, IPIP - Innings Pitched <lgl>,
# #   RR - Runs Allowed <lgl>, SOSO - Strikeouts <lgl>, BBBB - Walks <lgl>, HH - Hits Allowed <lgl>,
# #   RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed <lgl>

是的,我看到你希望得到的“第 10 个”(可能是第 9 个?),但显然它是空的。首先,我没有看到

id=
之类的东西可以清楚地将这两个表与其他表区分开来。其次,它是空的,表明数据是通过javascript填写的。

幸运的是,

rvest_1.0.4
添加了
rvest::read_html_live
,它使用无头Chrome(可能需要一些操作系统组件?),允许javascript组件工作,以便您可以收集必要的数据。

我将使用它,并建议一种动态获取您想要的表的方法。

page_dynamic <- read_html_live(url)
html_nodes(page_dynamic, "table")
# {xml_nodeset (16)}
#  [1] <table class="menu-scores-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL Games</div>\n<table><tbody>\n<tr>\n<td><a href="https://www.fangraphs. ...
#  [2] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/livewins.aspx?date=2024-04-05&amp;team=Yankees&amp;dh=0&amp;season=2024">TOR (3) @ NYY (0)</a ...
#  [3] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/livewins.aspx?date=2024-04-05&amp;team=Cubs&amp;dh=0&amp;season=2024">LAD (7) @ CHC (9)</a></ ...
#  [4] <table class="menu-standings-table"><tbody><tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<div id="menu-standings-ale"><table><tbody>\n<tr>\n<td> ...
#  [5] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
#  [6] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
#  [7] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
#  [8] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
#  [9] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
# [10] <table><tbody>\n<tr>\n<td><a href="https://www.fangraphs.com/leaders.aspx?pos=all&amp;stats=bat&amp;lg=all&amp;qual=0&amp;type=8&amp;season=2024&amp;month ...
# [11] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/teams/blue-jays">Blue Jays</a ...
# [12] <table class="menu-team-table"><tbody>\n<tr>\n<td>\n<div class="menu-sub-header">AL East</div>\n<a href="//www.fangraphs.com/roster-resource/depth-charts/ ...
# [13] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">BAL</td>\n<td><a href="https://blogs.fangraphs.com/chicago-white-sox-top-27-prospects">CH ...
# [14] <table><tbody>\n<tr>\n<td class="team-box_link__inactive__24Fnd">ATL</td>\n<td><a class="team-box_state__prelim__jxGiG" href="https://blogs.fangraphs.com/ ...
# [15] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left  fixed ">Name</th>\n<th dat ...
# [16] <table>\n<thead><tr>\n<th class="th-rank fixed">#</th>\n<th data-col="0" data-col-id="Name" data-stat="Name" class="align-left  fixed ">Name</th>\n<th dat ...

好吧,还有很多桌子。如果我们执行上面使用的

|> lapply(html_table)
技巧,我们会看到这次数据实际上已填充。查看这些表,其中只有两个具有列名称
"Name"
"Team"
(以及其他),因此让我们过滤这些表以检索数据。

my_tables <- html_nodes(page_dynamic, "table") |>
  lapply(html_table) |>
  Filter(\(tb) "Name" %in% names(tb), x = _)
my_tables
# [[1]]
# # A tibble: 409 × 11
#      `#` Name  Team  `GG - Games Pitched` PitchesPitches - Pit…¹ IPIP - Innings Pitch…² `RR - Runs Allowed` `SOSO - Strikeouts` `BBBB - Walks` `HH - Hits Allowed`
#    <int> <chr> <chr>                <int>                  <int>                  <dbl>               <int>               <int>          <int>               <int>
#  1     1 Shan… CLE                      2                    166                   12                     0                  20              1                  10
#  2     2 Brad… KCR                      2                    171                   13.1                   1                  14              2                   5
#  3     3 Cris… HOU                      2                    187                   11                     0                   9              6                   5
#  4     4 Fran… CIN                      2                    181                   11.2                   1                   9              3                   9
#  5     5 Seth… KCR                      2                    172                   12.2                   1                   7              3                  10
#  6     6 Rone… HOU                      1                    106                    9                     0                   7              2                   0
#  7     7 Zac … ARI                      2                    186                   11                     1                   9              5                   6
#  8     8 Merr… ARI                      2                    170                   13.2                   3                  12              1                   8
#  9     9 Jord… SFG                      2                    172                   12                     2                  11              1                   8
# 10    10 Garr… CHW                      2                    180                   13                     2                  16              1                   8
# # ℹ 399 more rows
# # ℹ abbreviated names: ¹​`PitchesPitches - Pitches Thrown`, ²​`IPIP - Innings Pitched`
# # ℹ 1 more variable: `RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed` <dbl>
# # ℹ Use `print(n = ...)` to see more rows
# [[2]]
# # A tibble: 409 × 11
#      `#` Name  Team  `GG - Games Pitched` PitchesPitches - Pit…¹ IPIP - Innings Pitch…² `RR - Runs Allowed` `SOSO - Strikeouts` `BBBB - Walks` `HH - Hits Allowed`
#    <int> <chr> <chr>                <int>                  <int>                  <dbl>               <int>               <int>          <int>               <int>
#  1     1 Shan… CLE                      2                    166                   12                     0                  20              1                  10
#  2     2 Brad… KCR                      2                    171                   13.1                   1                  14              2                   5
#  3     3 Cris… HOU                      2                    187                   11                     0                   9              6                   5
#  4     4 Fran… CIN                      2                    181                   11.2                   1                   9              3                   9
#  5     5 Seth… KCR                      2                    172                   12.2                   1                   7              3                  10
#  6     6 Rone… HOU                      1                    106                    9                     0                   7              2                   0
#  7     7 Zac … ARI                      2                    186                   11                     1                   9              5                   6
#  8     8 Merr… ARI                      2                    170                   13.2                   3                  12              1                   8
#  9     9 Jord… SFG                      2                    172                   12                     2                  11              1                   8
# 10    10 Garr… CHW                      2                    180                   13                     2                  16              1                   8
# # ℹ 399 more rows
# # ℹ abbreviated names: ¹​`PitchesPitches - Pitches Thrown`, ²​`IPIP - Innings Pitched`
# # ℹ 1 more variable: `RA9-WARRA9-WAR - Wins Above Replacement calculated using Runs Allowed` <dbl>
# # ℹ Use `print(n = ...)` to see more rows

在这种情况下,两个表恰好是相同的(这看起来很奇怪……为什么 HTML 中有两个相同的表?耸耸肩),但在某些情况下它们可能会不同。我将把探索留给你。

© www.soinside.com 2019 - 2024. All rights reserved.