RVest Web 抓取多个 URL(希望是个简单的问题)

问题描述 投票:0回答:1

我是一个菜鸟网络爬虫,所以对于这个基本问题深表歉意,但我已经四处搜索并尝试在此处应用以前的答案时遇到了困难。我试图在 fbref.com(Sports Reference 的一个子集)上抓取多个相关的 URL,但在我认为正确使用 lapply 时遇到了一个问题。我可以成功拉取一个 URL,但不是一次全部拉取。

这是我要做的事情的要点:

library("rvest")
library("tidyverse")

year1 <- paste0(2006:2021)
year2 <- paste0(2007:2022)

urls <- sort(rep(paste0("https://fbref.com/en/comps/Big5/", year1, "-", year2, "/stats/players/", year1, "-", year2, "-Big-5-European-Leagues-Stats")))

table <- read_html(urls) |> 
  html_nodes("table") |> 
  html_table()

我想我只需要 lapply 循环最后一部分,但我正在努力获得正确的格式。当使用最后一部分通过纯粹粘贴一个 URL 来读取其中一个 URL 时,如下所示,我得到了我想要的输出。我只是希望从 2006-07 到 2021-22 的所有年份都在一个 csv 文件中。

> url <- "https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats"
> table <- read_html(url) |> 
+     html_nodes("table") |> 
+     html_table()
> write.csv(table, file = "fbrefinitial.csv")

从那里开始,我想我只需要使用 bind_rows 以及 year1 或 year2 来为每一年添加一列,因为我想在一个 csv 文件的一个选项卡中获取所有内容。 (格式化该命令的正确方法是什么?)

这与这篇文章最相似,但我尝试以不同方式应用该逻辑的尝试不起作用。

感谢您的帮助!

r web-scraping lapply rvest
1个回答
1
投票

你可以这样做:

lapply(urls, function(url) {
  read_html(url) |> 
  html_nodes("table") |> 
  html_table()
})
#> [[1]]
#> [[1]][[1]]
#> # A tibble: 2,687 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Dani Aba~ es E~ FW,MF Celt~ es L~ 18    1987  1       0       13      0.1    
#>  3 2     Jacques ~ fr F~ DF    Nice  fr L~ 28    1978  30      28      2,492   27.7   
#>  4 3     Christia~ it I~ GK    Tori~ it S~ 29    1977  36      36      3,235   35.9   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 33    1972  36      36      3,215   35.7   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 25    1981  29      15      1,432   15.9   
#>  7 6     Nadjim A~ km C~ MF    Sedan fr L~ 22    1984  17      11      1,136   12.6   
#>  8 7     Nelson A~ uy U~ MF    Atal~ it S~ 33    1973  5       2       121     1.3    
#>  9 8     Mathias ~ de G~ DF    Hamb~ de B~ 25    1981  8       4       416     4.6    
#> 10 9     Éric Abi~ fr F~ DF    Lyon  fr L~ 26    1979  33      31      2,750   30.6   
#> # ... with 2,677 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[2]]
#> [[2]][[1]]
#> # A tibble: 2,770 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Nice  fr L~ 29    1978  10      4       434     4.8    
#>  3 2     Jacques ~ fr F~ DF    Nürn~ de B~ 29    1978  10      9       820     9.1    
#>  4 3     Ignazio ~ it I~ DF,MF Empo~ it S~ 20    1986  24      9       1,167   13.0   
#>  5 4     Christia~ it I~ GK    Atlé~ es L~ 30    1977  21      20      1,804   20.0   
#>  6 5     Pato Abb~ ar A~ GK    Geta~ es L~ 34    1972  34      34      3,046   33.8   
#>  7 6     Yacine A~ ma M~ MF    Stra~ fr L~ 26    1981  23      17      1,549   17.2   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 25    1982  26      24      2,230   24.8   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 27    1979  30      28      2,523   28.0   
#> 10 9     Ahmed Ab~ eg E~ DF,MF Stra~ fr L~ 26    1981  2       1       91      1.0    
#> # ... with 2,760 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> 
#> [[3]]
#> [[3]][[1]]
#> # A tibble: 2,796 x 29
#>    ``    ``        ``    ``    ``    ``    ``    ``    Playi~1 Playi~2 Playi~3 Playi~4
#>    <chr> <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>   <chr>   <chr>   <chr>  
#>  1 Rk    Player    Nati~ Pos   Squad Comp  Age   Born  MP      Starts  Min     90s    
#>  2 1     Jacques ~ fr F~ DF    Vale~ fr L~ 30    1978  18      14      1,252   13.9   
#>  3 2     Ignazio ~ it I~ DF,MF Tori~ it S~ 21    1986  25      21      1,913   21.3   
#>  4 3     Christia~ it I~ GK    Milan it S~ 31    1977  28      28      2,441   27.1   
#>  5 4     Pato Abb~ ar A~ GK    Geta~ es L~ 35    1972  13      13      1,083   12.0   
#>  6 5     Elvis Ab~ it I~ FW    Tori~ it S~ 27    1981  10      2       388     4.3    
#>  7 6     Djamel A~ dz A~ MF    Nant~ fr L~ 22    1986  22      12      1,139   12.7   
#>  8 7     Damià Ab~ es E~ DF,MF Betis es L~ 26    1982  25      20      1,788   19.9   
#>  9 8     Éric Abi~ fr F~ DF    Barc~ es L~ 28    1979  25      25      2,116   23.5   
#> 10 9     Fabrice ~ fr F~ MF    Lori~ fr L~ 29    1979  35      35      3,060   34.0   
#> # ... with 2,786 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> #   Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> #   Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> #   `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> #   and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> #   3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 

© www.soinside.com 2019 - 2024. All rights reserved.