在特定的HTML标题后使用rvest来刮取一个表格

问题描述 投票:0回答:1

我正在寻找建立一个API Scraper,并希望在每个网页的API中刮取一个特定的表。

在这种情况下,我希望刮取H4 "参数 "之后的表。 请看下面。

library(rvest)
#> Loading required package: xml2
library(magrittr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)

page <- read_html("http://developer.cbssports.com/documentation/api/files/league-stats") 

# List of Tables in Page.  The tables I want are 3:6
page %>% 
  html_table() 
#> [[1]]
#>   X1 X2      X3
#> 1 NA NA Sign-In
#> 
#> [[2]]
#>                                               X1
#> 1                                   League Stats
#> 2                                       Resource
#> 3                                   League Stats
#> 4                                     Change Log
#> 5                               January 15, 2013
#> 6                              February 26, 2015
#> 7                                   HTTP Methods
#> 8                                            GET
#> 9                                     Exceptions
#> 10 Following list of exceptions will be returned
#> 11                                       Example
#> 12                         Retrieve League Stats
#> 13                         XML field definitions
#> 14                                 /league_stats
#> 15                         /league_stats/players
#> 16                  /league_stats/players/player
#> 17              /league_stats/players/player/@id
#> 18             /league_stats/players/player/stat
#>                                                                                                                                   X2
#> 1                                                                                                                                   
#> 2                                                                                                                                   
#> 3            Provides the statistics for a fantasy league by timeframe, period, player status, team id, team totals, and stats type.
#> 4                                                                                                                                   
#> 5                                                                                                                                   
#> 6                                                                                                                                   
#> 7                                                                                                                                   
#> 8  Retrieve stats for a league for a specified timeframe, period, team id, team totals, pro or fantasy, actual stats or projections.
#> 9                                                                                                                                   
#> 10                                                                                                                                  
#> 11                                                                                                                                  
#> 12                                                                                                                                  
#> 13                                                                                                                                  
#> 14                                                                                         A container element for the league stats.
#> 15                                                             A container element for the list of players whose stats are returned.
#> 16                                                                                              A containter element for the player.
#> 17                                                                                                                   ID of a player.
#> 18                                                                               Definition: /league_stats/players/player/stat/@abbr
#> 
#> [[3]]
#>                X1
#> 1 response_format
#> 2           limit
#> 3          offset
#> 4       player_id
#> 5   player_status
#> 6        position
#> 7       timeframe
#> 8          period
#>                                                                                                                                                                                                                                                        X2
#> 1                                                                                                                (optional) Specifies the format in which the requested resource should be returned.  Valid values are XML and JSON.  The default is XML.
#> 2                                                                                                                                                    (optional) Specifies the number of players to be returned.  By default, the API returns all players.
#> 3                                                                                                                                                                     (optional, defaults to 0) Specifies the offset for the first player to be returned.
#> 4                      (optional, multiple allowed and can be passed as a comma-delimited list, can not be used with player_status, position, team_id, team_type, pro_or_fantasy params) Limit the result set to players with ids specified in this param
#> 5                                                                                                                                                      (optional, defaults to free_agents) Limit the result set by player_status (‘free_agents’ or ‘all’)
#> 6 (optional, defaults to all batters, multiple allowed and can be passed as a comma-delimited list) Limits the result set to a particular position, which is specified by a position code.  For a list of position codes, request the Positions resource.
#> 7                                                                         (optional, defaults to current year) Year for which stats are being requested (any year in YYYY format from 1997 to current year).  Only supported period for past years is ytd
#> 8                                                                                                                                                                                 (optional, defaults to ytd) Period for which stats are being requested.
#> 
#> [[4]]
#>           X1
#> 1 stats_type
#>                                                                                                                                                                                    X2
#> 1 (optional, defaults to ‘stats’) Determines if the resource will include actual stats, projections or Football Red Zone Stats.  Possible values are (stats, projections or redzone).
#> 
#> [[5]]
#>       X1
#> 1 source
#>                                                                                                               X2
#> 1 (optional, defaults to ‘cbs’ for stats_type=projections) The source for which projections are being requested.
#> 
#> [[6]]
#>               X1
#> 1        team_id
#> 2      team_type
#> 3 pro_or_fantasy
#>                                                                                                                                                                                                                                                                                                                                                                                      X2
#> 1                                                                                                                                                          (required if requesting stats for a particular team or team totals) ID of the fantasy team whose stats are being requested (use all to get team totals).  To get a list of fantasy team codes, request the <Teams> resource.
#> 2                                                                                                                                                                                                        (optional, defaults to ‘roster’) Determine whether to get a team’s roster or their scout team (‘roster’, ‘active’, ‘reserve’ or ‘scout_team’).  Can only be used with team_id.
#> 3 (optional, defaults to ‘fantasy’) Determines whether to count the stats for players only when they were active on the fantasy team’s roster or even when they were on the bench (‘fantasy’ or ‘pro’).  Can only be used with team_id.  fantasy is only supported for current year and periods ytd, y, period_number, or Xd (Last X Days), team_type active, and for stats_type stats.

# List of H4 in Page
page %>% 
  html_nodes("h4") # I want the 3rd value "Parameters"
#> {xml_nodeset (13)}
#>  [1] <h4 class="CHeading">Description</h4>\n
#>  [2] <h4 class="CHeading">URL</h4>\n
#>  [3] <h4 class="CHeading">Parameters</h4>\n
#>  [4] <h4 class="CHeading">Sample URL</h4>\n
#>  [5] <h4 class="CHeading">XML Response</h4>\n
#>  [6] <h4 class="CHeading">Notes</h4>\n
#>  [7] <h4 class="CHeading">JSON Response</h4>\n
#>  [8] <h4 class="CHeading">Notes</h4>\n
#>  [9] <h4 class="CHeading">Sample URL</h4>\n
#> [10] <h4 class="CHeading">XML Response</h4>\n
#> [11] <h4 class="CHeading">Notes</h4>\n
#> [12] <h4 class="CHeading">JSON Response</h4>\n
#> [13] <h4 class="CHeading">Notes</h4>\n


# Manual way to get information I'm looking for
parameters <- page %>% 
  html_table() %>% 
  .[3:6] %>% 
  dplyr::bind_rows()

我如何只获取Parameters之后的表? 有没有一种更直接的方法可以做到这一点,而不必每次都查看表的引用?

r rvest
1个回答
0
投票

怎么样更好地指定结束符,通过 html_nodes()?

library(rvest)
#> Loading required package: xml2
library(magrittr, warn.conflicts = FALSE)
library(purrr, warn.conflicts = FALSE)

page <- read_html("http://developer.cbssports.com/documentation/api/files/league-stats") 

page %>% 
  html_nodes("h4")
#> {xml_nodeset (13)}
#>  [1] <h4 class="CHeading">Description</h4>\n
#>  [2] <h4 class="CHeading">URL</h4>\n
#>  [3] <h4 class="CHeading">Parameters</h4>\n
#>  [4] <h4 class="CHeading">Sample URL</h4>\n
#>  [5] <h4 class="CHeading">XML Response</h4>\n
#>  [6] <h4 class="CHeading">Notes</h4>\n
#>  [7] <h4 class="CHeading">JSON Response</h4>\n
#>  [8] <h4 class="CHeading">Notes</h4>\n
#>  [9] <h4 class="CHeading">Sample URL</h4>\n
#> [10] <h4 class="CHeading">XML Response</h4>\n
#> [11] <h4 class="CHeading">Notes</h4>\n
#> [12] <h4 class="CHeading">JSON Response</h4>\n
#> [13] <h4 class="CHeading">Notes</h4>\n

# List of Tables in Page.  The tables I want are 3:6
out = page %>% 
  html_nodes("#Content > div.CHTTPMethods > div > div > table:nth-child(5)") %>% 
  html_table()

table_1 = out[[1]]
out = page %>% 
  html_nodes("#Content > div.CHTTPMethods > div > div > table:nth-child(18)") %>% 
  html_table()
table_2 = out[[1]]

table_1
#>                X1
#> 1 response_format
#> 2           limit
#> 3          offset
#> 4       player_id
#> 5   player_status
#> 6        position
#> 7       timeframe
#> 8          period
#>                                                                                                                                                                                                                                                        X2
#> 1                                                                                                                (optional) Specifies the format in which the requested resource should be returned.  Valid values are XML and JSON.  The default is XML.
#> 2                                                                                                                                                    (optional) Specifies the number of players to be returned.  By default, the API returns all players.
#> 3                                                                                                                                                                     (optional, defaults to 0) Specifies the offset for the first player to be returned.
#> 4                      (optional, multiple allowed and can be passed as a comma-delimited list, can not be used with player_status, position, team_id, team_type, pro_or_fantasy params) Limit the result set to players with ids specified in this param
#> 5                                                                                                                                                      (optional, defaults to free_agents) Limit the result set by player_status (‘free_agents’ or ‘all’)
#> 6 (optional, defaults to all batters, multiple allowed and can be passed as a comma-delimited list) Limits the result set to a particular position, which is specified by a position code.  For a list of position codes, request the Positions resource.
#> 7                                                                         (optional, defaults to current year) Year for which stats are being requested (any year in YYYY format from 1997 to current year).  Only supported period for past years is ytd
#> 8                                                                                                                                                                                 (optional, defaults to ytd) Period for which stats are being requested.
table_2
#>               X1
#> 1        team_id
#> 2      team_type
#> 3 pro_or_fantasy
#>                                                                                                                                                                                                                                                                                                                                                                                      X2
#> 1                                                                                                                                                          (required if requesting stats for a particular team or team totals) ID of the fantasy team whose stats are being requested (use all to get team totals).  To get a list of fantasy team codes, request the <Teams> resource.
#> 2                                                                                                                                                                                                        (optional, defaults to ‘roster’) Determine whether to get a team’s roster or their scout team (‘roster’, ‘active’, ‘reserve’ or ‘scout_team’).  Can only be used with team_id.
#> 3 (optional, defaults to ‘fantasy’) Determines whether to count the stats for players only when they were active on the fantasy team’s roster or even when they were on the bench (‘fantasy’ or ‘pro’).  Can only be used with team_id.  fantasy is only supported for current year and periods ytd, y, period_number, or Xd (Last X Days), team_type active, and for stats_type stats.

创建于2020-05-05 重读包 (v0.3.0)

© www.soinside.com 2019 - 2024. All rights reserved.