无法使用 rvest 抓取评论后面页面内的所有表格

问题描述 投票:0回答:1

我正在尝试从此页面中抓取所有表格:https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml

我发现有些表格位于注释标签内,因此使用从here改编的代码,我有:

library(magrittr)
library(rvest)
library(xml2)
library(stringi)

urlbbref <- read_html("https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table

# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
  #Find only commented nodes that contain the regex for html table markup
  raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
  # Remove the comment begin and end tags
  strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
                                                vectorize_all = FALSE)
  # Loop through the pieces that have tables within markup and 
  # apply the same functions
  lapply(grep("<table", strip_html, value = TRUE), function(i){
    rvest::html_table(xml_find_all(read_html(i), "//table")) %>% 
      .[[1]]
  })
}
# Put all the data frames into a list.
all_tables <- c(
  table_one, alt_tables
)

但是,第二个投球表没有出现(亚利桑那州)。我可以得到第一个使用

all_tables[9]

输出:

> all_tables[9]
[[1]]
# A tibble: 4 × 27
  Pitching         IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str  Ctct   StS
  <chr>         <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int>
1 Nathan Eoval…   6       4     0     0     5     5     0  2.95    27    97    60    40     7
2 Aroldis Chap…   0.2     0     0     0     1     1     0  2.25     3    10     4     2     1
3 Josh Sborz, …   2.1     1     0     0     0     4     0  0.75     8    31    20    11     1
4 Team Totals     9       5     0     0     6    10     0  0       38   138    84    53     9
# ℹ 13 more variables: StL <int>, GB <int>, FB <int>, LD <int>, Unk <int>, GSc <int>,
#   IR <int>, IS <int>, WPA <dbl>, aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>

但由于某种原因,第二个表没有出现,我不知道为什么或如何获取它?

r rvest
1个回答
0
投票

有这样的事吗?

pacman::p_load(rvest, tidyverse)

path <- "https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml"

path |>
  read_html() |>
  html_nodes(xpath = '//comment()[contains(., "div class")]') |> 
  map(\(x) x |> 
             as.character() |> 
             str_remove_all("<!--|-->") |> 
             read_html() |> 
             html_table()) |> 
  unlist(recursive = FALSE)

输出:

[[1]]
# A tibble: 14 × 24
   Batting    AB     R     H   RBI    BB    SO    PA     BA    OBP    SLG    OPS
   <chr>   <int> <int> <int> <int> <int> <int> <int>  <dbl>  <dbl>  <dbl>  <dbl>
 1 "Marcu…     5     1     2     2     0     1     5  0.224  0.28   0.355  0.636
 2 "Corey…     4     1     2     0     1     0     5  0.318  0.451  0.682  1.13 
 3 "Evan …     5     0     1     0     0     4     5  0.3    0.417  0.5    0.917
 4 "Mitch…     4     0     1     1     0     1     4  0.226  0.317  0.434  0.751
 5 "Josh …     4     1     1     0     0     1     4  0.308  0.329  0.538  0.867
 6 "Natha…     3     1     1     0     1     0     4  0.212  0.278  0.379  0.657
 7 "Jonah…     4     1     1     1     0     1     4  0.212  0.268  0.348  0.616
 8 "Leody…     4     0     0     0     0     1     4  0.175  0.299  0.281  0.579
 9 "Travi…     3     0     0     0     1     0     4  0.333  0.4    0.444  0.844
10 ""         NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
11 "Natha…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
12 "Arold…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
13 "Josh …    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
14 "Team …    36     5     9     4     3     9    39  0.25   0.308  0.361  0.669
# ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#   A <int>, Details <chr>

[[2]]
# A tibble: 16 × 24
   Batting    AB     R     H   RBI    BB    SO    PA     BA    OBP    SLG    OPS
   <chr>   <int> <int> <int> <int> <int> <int> <int>  <dbl>  <dbl>  <dbl>  <dbl>
 1 "Corbi…     4     0     1     0     1     0     5  0.273  0.364  0.409  0.773
 2 "Ketel…     2     0     0     0     3     1     5  0.329  0.38   0.534  0.914
 3 "Gabri…     3     0     0     0     0     2     4  0.238  0.304  0.444  0.749
 4 "Chris…     3     0     1     0     1     1     4  0.217  0.36   0.35   0.71 
 5 "Tommy…     3     0     0     0     1     1     4  0.279  0.297  0.475  0.772
 6 "Lourd…     4     0     1     0     0     0     4  0.273  0.29   0.455  0.744
 7 "Alek …     4     0     1     0     0     0     4  0.222  0.271  0.463  0.734
 8 "Evan …     3     0     1     0     0     1     3  0.167  0.226  0.229  0.456
 9 "Pavin…     1     0     0     0     0     1     1  0.3    0.364  0.3    0.664
10 "Emman…     0     0     0     0     0     0     0  0.235  0.278  0.294  0.572
11 "Geral…     4     0     0     0     0     3     4  0.275  0.362  0.392  0.754
12 ""         NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
13 "Zac G…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
14 "Kevin…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
15 "Paul …    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
16 "Team …    31     0     5     0     6    10    38  0.161  0.297  0.194  0.491
# ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#   A <int>, Details <chr>

[[3]]
# A tibble: 4 × 27
  Pitching        IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str
  <chr>        <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
1 Nathan Eova…   6       4     0     0     5     5     0  2.95    27    97    60
2 Aroldis Cha…   0.2     0     0     0     1     1     0  2.25     3    10     4
3 Josh Sborz,…   2.1     1     0     0     0     4     0  0.75     8    31    20
4 Team Totals    9       5     0     0     6    10     0  0       38   138    84
# ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
#   LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
#   cWPA <chr>, acLI <dbl>, RE24 <dbl>

[[4]]
# A tibble: 4 × 27
  Pitching        IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str
  <chr>        <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
1 Zac Gallen,…   6.1     3     1     1     1     6     0  4.54    23    83    57
2 Kevin Ginkel   1.2     1     0     0     2     1     0  0        8    32    15
3 Paul Sewald    1       5     4     4     0     2     1  5.4      8    20    14
4 Team Totals    9       9     5     5     3     9     1  5       39   135    86
# ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
#   LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
#   cWPA <chr>, acLI <dbl>, RE24 <dbl>

[[5]]
# A tibble: 10 × 3
      X1 X2               X3   
   <int> <chr>            <chr>
 1     1 Marcus Semien    2B   
 2     2 Corey Seager     SS   
 3     3 Evan Carter      LF   
 4     4 Mitch Garver     DH   
 5     5 Josh Jung        3B   
 6     6 Nathaniel Lowe   1B   
 7     7 Jonah Heim       C    
 8     8 Leody Taveras    CF   
 9     9 Travis Jankowski RF   
10    NA Nathan Eovaldi   P    

[[6]]
# A tibble: 10 × 3
      X1 X2                  X3   
   <int> <chr>               <chr>
 1     1 Corbin Carroll      RF   
 2     2 Ketel Marte         2B   
 3     3 Gabriel Moreno      C    
 4     4 Christian Walker    1B   
 5     5 Tommy Pham          DH   
 6     6 Lourdes Gurriel Jr. LF   
 7     7 Alek Thomas         CF   
 8     8 Evan Longoria       3B   
 9     9 Geraldo Perdomo     SS   
10    NA Zac Gallen          P    

[[7]]
# A tibble: 5 × 12
  Inn   Score   Out RoB   `Pit(cnt)`     `R/O` `@Bat` Batter Pitcher wWPA  wWE  
  <chr> <chr> <int> <chr> <chr>          <chr> <chr>  <chr>  <chr>   <chr> <chr>
1 t7    0-0       0 1--   2,(1-0) BX     ""    TEX    Evan … Zac Ga… 17%   73%  
2 t7    0-0       0 -23   2,(0-1) FX     "R"   TEX    Mitch… Zac Ga… 10%   82%  
3 b5    0-0       2 123   1,(0-0) X      "O"   ARI    Lourd… Nathan… 9%    50%  
4 t9    1-0       0 12-   1,(0-0) X      "RR"  TEX    Jonah… Paul S… 9%    98%  
5 b3    0-0       1 -23   6,(2-2) B*BFF… "O"   ARI    Chris… Nathan… 8%    43%  
# ℹ 1 more variable: `Play Description` <chr>

[[8]]
# A tibble: 120 × 12
   Inn      Score Out   RoB   `Pit(cnt)` `R/O` `@Bat` Batter Pitcher wWPA  wWE  
   <chr>    <chr> <chr> <chr> <chr>      <chr> <chr>  <chr>  <chr>   <chr> <chr>
 1 "Top of… "Top… "Top… "Top… "Top of t… "Top… "Top … "Top … "Top o… Top … Top …
 2 "t1"     "0-0" "0"   "---" "4,(2-1) … "O"   "TEX"  "Marc… "Zac G… -2%   48%  
 3 "t1"     "0-0" "1"   "---" "5,(1-2) … "O"   "TEX"  "Core… "Zac G… -2%   46%  
 4 "t1"     "0-0" "2"   "---" "4,(1-2) … "O"   "TEX"  "Evan… "Zac G… -1%   45%  
 5 ""       ""    ""    ""    ""         ""    ""     ""     ""      0 ru… 0 ru…
 6 "Bottom… "Bot… "Bot… "Bot… "Bottom o… "Bot… "Bott… "Bott… "Botto… Bott… Bott…
 7 "b1"     "0-0" "0"   "---" "4,(3-0) … ""    "ARI"  "Corb… "Natha… -3%   42%  
 8 "b1"     "0-0" "0"   "1--" "1,(0-0) … ""    "ARI"  "Kete… "Natha… -2%   39%  
 9 "b1"     "0-0" "0"   "-2-" "3,(0-2) … "O"   "ARI"  "Kete… "Natha… 1%    41%  
10 "b1"     "0-0" "1"   "--3" "3,(1-1) … "O"   "ARI"  "Gabr… "Natha… 6%    46%  
# ℹ 110 more rows
# ℹ 1 more variable: `Play Description` <chr>
# ℹ Use `print(n = ...)` to see more rows
© www.soinside.com 2019 - 2024. All rights reserved.