R 中的网络抓取 espn 框得分数据

问题描述 投票:0回答:1

我对网络抓取非常陌生,我正在尝试提取 NHL 某些比赛的得分表中的所有信息。例如id为401459058的游戏,就是这个表中的所有信息(url:https://www.espn.com/nhl/boxscore/_/gameId/401044673):

我尝试使用以下内容:

library(RSelenium)
library(netstat)
library(wdman)
library(rvest)
library(xml2)
library(dplyr)

url = 'https://www.espn.com/nhl/boxscore/_/gameId/401044673'

rD = rsDriver(browser='firefox', chromever='114.0.5735.90', port = free_port()) #specify chrome version

remDr = rD[['client']]
remDr$open()
remDr$navigate(url)
src = remDr$getPageSource()[[1]] 

df = read_html(src) %>% 
  html_elements(xpath = "//tr[@class = 'Table__TR Table__TR--sm Table__even']//text()") %>% 
  html_text

结果是单个字符向量。如果我使用:

as.data.frame(matrix(unlist(df),nrow=length(df),byrow=TRUE))

我得到一列中的值,如下所示:

2                 2
3                 4
4                 2
5                 8
6               STL
7                 2
8                 0
9                 2
10                4
11          Skaters
12        K. Connor
13                 
14               LW
15        N. Ehlers
16   

但是我不知道如何将这些值放入看起来像网站上的表格的 df 中,这是预期的输出。

r selenium-webdriver web-scraping rvest
1个回答
1
投票

表格包含在页面内容中,因此对于该确切任务,不需要

{rselenium}
,而
{rvest}
应该就可以了。但每队有 4 张桌子。

以下处理团队部分 (2),绑定两个部分中的 2 个表对并返回表列表,两个团队各 2 个:

library(rvest)
library(dplyr)
library(purrr)
library(tidyr)

url_ <- "https://www.espn.com/nhl/boxscore/_/gameId/401459058"

read_html(url_) %>% 
  # extract team sections (2)
  html_elements("div.Boxscore div.Wrapper") %>% 
  # extract team names, use as list element names
  set_names(html_elements(., ".BoxscoreItem__TeamName") %>% html_text()) %>% 
  # extact table elements, 4 per team
  map(\(team_section) html_elements(team_section, "table")) %>% 
  map(\(team_tables) list(
    # bind tables 1 & 2 (skaters/defensemen and data section)
    tbl_1 = html_table(team_tables[1:2]) %>% 
      bind_cols(.name_repair = "minimal") %>% 
      # column names from the first row
      set_names(.[1,]) %>% 
      rename(player = Skaters) %>% 
      # position to spearate column
      mutate(position = if_else(G == "G", player, NA), .before = 1) %>% 
      fill(position, .direction = "down") %>% 
      # remove rows with header info
      filter(G != "G"),
    # bind tables 3 & 4 (goalies and data section)
    tbl_2 = html_table(team_tables[3:4]) %>% 
      bind_cols(.name_repair = "minimal") %>% 
      set_names(.[1,]) %>% 
      filter(SA != "SA")
    )
  ) 

结果:

#> $`Los Angeles Kings`
#> $`Los Angeles Kings`$tbl_1
#> # A tibble: 18 × 21
#>    position   player G     A     `+/-` S     SM    BS    PN    PIM   HT    TK   
#>    <chr>      <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 Skaters    J. An… 0     0     -1    1     2     0     0     0     2     0    
#>  2 Skaters    P. Da… 0     0     0     1     1     2     1     2     0     2    
#>  3 Skaters    K. Fi… 0     2     1     4     1     0     0     0     0     0    
#>  4 Skaters    C. Gr… 0     0     0     0     0     0     1     2     0     0    
#>  5 Skaters    A. Ia… 0     0     0     3     0     0     0     0     0     1    
#>  6 Skaters    A. Ka… 0     0     0     1     1     0     0     0     0     0    
#>  7 Skaters    A. Ke… 2     0     0     3     2     0     0     0     1     0    
#>  8 Skaters    A. Ko… 0     1     1     3     2     1     0     0     1     3    
#>  9 Skaters    R. Ku… 0     0     0     0     0     0     0     0     1     0    
#> 10 Skaters    B. Li… 0     0     0     3     0     0     1     2     0     1    
#> 11 Skaters    T. Mo… 0     0     0     6     3     1     1     2     0     1    
#> 12 Skaters    G. Vi… 0     0     -1    0     1     0     0     0     0     1    
#> 13 defensemen M. An… 0     0     0     1     0     2     0     0     1     0    
#> 14 defensemen D. Do… 0     1     0     1     1     1     2     4     0     1    
#> 15 defensemen S. Du… 0     0     -1    0     1     2     1     2     0     1    
#> 16 defensemen A. Ed… 0     0     1     2     1     6     0     0     4     0    
#> 17 defensemen M. Ro… 0     0     -1    1     0     2     0     0     3     1    
#> 18 defensemen S. Wa… 0     0     1     0     0     3     0     0     0     0    
#> # ℹ 9 more variables: GV <chr>, SHFT <chr>, TOI <chr>, PPTOI <chr>,
#> #   SHTOI <chr>, ESTOI <chr>, FW <chr>, FL <chr>, `FO%` <chr>
#> 
#> $`Los Angeles Kings`$tbl_2
#> # A tibble: 1 × 12
#>   goalies     SA    GA    SV    `SV%` ESSV  PPSV  SHSV  SOSA  SOS   TOI   PIM  
#>   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 P. Copley G 35    2     33    .943  17    12    4     0     0     64:48 0    
#> 
#> 
#> $`Boston Bruins`
#> $`Boston Bruins`$tbl_1
#> # A tibble: 18 × 21
#>    position   player G     A     `+/-` S     SM    BS    PN    PIM   HT    TK   
#>    <chr>      <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#>  1 Skaters    P. Be… 0     0     1     6     0     2     0     0     0     0    
#>  2 Skaters    C. Co… 0     1     1     0     0     0     0     0     1     2    
#>  3 Skaters    J. De… 0     0     0     2     1     1     0     0     0     0    
#>  4 Skaters    N. Fo… 0     0     0     0     0     2     0     0     1     1    
#>  5 Skaters    T. Fr… 0     0     1     0     2     0     0     0     4     0    
#>  6 Skaters    A.J. … 0     0     0     0     0     2     0     0     3     1    
#>  7 Skaters    T. Ha… 1     0     1     4     2     0     0     0     0     0    
#>  8 Skaters    D. Kr… 0     0     -1    1     0     0     0     0     1     1    
#>  9 Skaters    B. Ma… 1     0     0     2     1     0     3     6     2     1    
#> 10 Skaters    T. No… 0     0     0     2     1     0     0     0     0     0    
#> 11 Skaters    D. Pa… 0     1     -1    5     4     0     0     0     0     1    
#> 12 Skaters    P. Za… 0     0     -1    3     0     0     0     0     1     0    
#> 13 defensemen B. Ca… 0     0     1     0     2     3     1     2     1     0    
#> 14 defensemen C. Cl… 0     0     0     1     0     1     1     2     5     2    
#> 15 defensemen D. Fo… 0     0     -1    3     0     1     0     0     2     0    
#> 16 defensemen M. Gr… 0     1     1     3     0     2     0     0     0     1    
#> 17 defensemen H. Li… 0     0     0     1     0     1     0     0     0     0    
#> 18 defensemen C. Mc… 0     1     -1    2     1     1     1     4     1     1    
#> # ℹ 9 more variables: GV <chr>, SHFT <chr>, TOI <chr>, PPTOI <chr>,
#> #   SHTOI <chr>, ESTOI <chr>, FW <chr>, FL <chr>, `FO%` <chr>
#> 
#> $`Boston Bruins`$tbl_2
#> # A tibble: 1 × 12
#>   goalies      SA    GA    SV    `SV%` ESSV  PPSV  SHSV  SOSA  SOS   TOI   PIM  
#>   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 L. Ullmark G 30    2     28    .933  23    3     2     0     0     64:43 0

创建于 2023 年 10 月 12 日,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.