我是一个菜鸟网络爬虫,所以对于这个基本问题深表歉意,但我已经四处搜索并尝试在此处应用以前的答案时遇到了困难。我试图在 fbref.com(Sports Reference 的一个子集)上抓取多个相关的 URL,但在我认为正确使用 lapply 时遇到了一个问题。我可以成功拉取一个 URL,但不是一次全部拉取。
这是我要做的事情的要点:
library("rvest")
library("tidyverse")
year1 <- paste0(2006:2021)
year2 <- paste0(2007:2022)
urls <- sort(rep(paste0("https://fbref.com/en/comps/Big5/", year1, "-", year2, "/stats/players/", year1, "-", year2, "-Big-5-European-Leagues-Stats")))
table <- read_html(urls) |>
html_nodes("table") |>
html_table()
我想我只需要 lapply 循环最后一部分,但我正在努力获得正确的格式。当使用最后一部分通过纯粹粘贴一个 URL 来读取其中一个 URL 时,如下所示,我得到了我想要的输出。我只是希望从 2006-07 到 2021-22 的所有年份都在一个 csv 文件中。
> url <- "https://fbref.com/en/comps/Big5/2021-2022/stats/players/2021-2022-Big-5-European-Leagues-Stats"
> table <- read_html(url) |>
+ html_nodes("table") |>
+ html_table()
> write.csv(table, file = "fbrefinitial.csv")
从那里开始,我想我只需要使用 bind_rows 以及 year1 或 year2 来为每一年添加一列,因为我想在一个 csv 文件的一个选项卡中获取所有内容。 (格式化该命令的正确方法是什么?)
这与这篇文章最相似,但我尝试以不同方式应用该逻辑的尝试不起作用。
感谢您的帮助!
你可以这样做:
lapply(urls, function(url) {
read_html(url) |>
html_nodes("table") |>
html_table()
})
#> [[1]]
#> [[1]][[1]]
#> # A tibble: 2,687 x 29
#> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s
#> 2 1 Dani Aba~ es E~ FW,MF Celt~ es L~ 18 1987 1 0 13 0.1
#> 3 2 Jacques ~ fr F~ DF Nice fr L~ 28 1978 30 28 2,492 27.7
#> 4 3 Christia~ it I~ GK Tori~ it S~ 29 1977 36 36 3,235 35.9
#> 5 4 Pato Abb~ ar A~ GK Geta~ es L~ 33 1972 36 36 3,215 35.7
#> 6 5 Elvis Ab~ it I~ FW Tori~ it S~ 25 1981 29 15 1,432 15.9
#> 7 6 Nadjim A~ km C~ MF Sedan fr L~ 22 1984 17 11 1,136 12.6
#> 8 7 Nelson A~ uy U~ MF Atal~ it S~ 33 1973 5 2 121 1.3
#> 9 8 Mathias ~ de G~ DF Hamb~ de B~ 25 1981 8 4 416 4.6
#> 10 9 Éric Abi~ fr F~ DF Lyon fr L~ 26 1979 33 31 2,750 30.6
#> # ... with 2,677 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> # 3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#>
#>
#> [[2]]
#> [[2]][[1]]
#> # A tibble: 2,770 x 29
#> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s
#> 2 1 Jacques ~ fr F~ DF Nice fr L~ 29 1978 10 4 434 4.8
#> 3 2 Jacques ~ fr F~ DF Nürn~ de B~ 29 1978 10 9 820 9.1
#> 4 3 Ignazio ~ it I~ DF,MF Empo~ it S~ 20 1986 24 9 1,167 13.0
#> 5 4 Christia~ it I~ GK Atlé~ es L~ 30 1977 21 20 1,804 20.0
#> 6 5 Pato Abb~ ar A~ GK Geta~ es L~ 34 1972 34 34 3,046 33.8
#> 7 6 Yacine A~ ma M~ MF Stra~ fr L~ 26 1981 23 17 1,549 17.2
#> 8 7 Damià Ab~ es E~ DF,MF Betis es L~ 25 1982 26 24 2,230 24.8
#> 9 8 Éric Abi~ fr F~ DF Barc~ es L~ 27 1979 30 28 2,523 28.0
#> 10 9 Ahmed Ab~ eg E~ DF,MF Stra~ fr L~ 26 1981 2 1 91 1.0
#> # ... with 2,760 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> # 3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#>
#>
#> [[3]]
#> [[3]][[1]]
#> # A tibble: 2,796 x 29
#> `` `` `` `` `` `` `` `` Playi~1 Playi~2 Playi~3 Playi~4
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Rk Player Nati~ Pos Squad Comp Age Born MP Starts Min 90s
#> 2 1 Jacques ~ fr F~ DF Vale~ fr L~ 30 1978 18 14 1,252 13.9
#> 3 2 Ignazio ~ it I~ DF,MF Tori~ it S~ 21 1986 25 21 1,913 21.3
#> 4 3 Christia~ it I~ GK Milan it S~ 31 1977 28 28 2,441 27.1
#> 5 4 Pato Abb~ ar A~ GK Geta~ es L~ 35 1972 13 13 1,083 12.0
#> 6 5 Elvis Ab~ it I~ FW Tori~ it S~ 27 1981 10 2 388 4.3
#> 7 6 Djamel A~ dz A~ MF Nant~ fr L~ 22 1986 22 12 1,139 12.7
#> 8 7 Damià Ab~ es E~ DF,MF Betis es L~ 26 1982 25 20 1,788 19.9
#> 9 8 Éric Abi~ fr F~ DF Barc~ es L~ 28 1979 25 25 2,116 23.5
#> 10 9 Fabrice ~ fr F~ MF Lori~ fr L~ 29 1979 35 35 3,060 34.0
#> # ... with 2,786 more rows, 17 more variables: Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Performance <chr>, Performance <chr>,
#> # Performance <chr>, Performance <chr>, Progression <chr>, Progression <chr>,
#> # Progression <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>,
#> # `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `Per 90 Minutes` <chr>, `` <chr>,
#> # and abbreviated variable names 1: `Playing Time`, 2: `Playing Time`,
#> # 3: `Playing Time`, 4: `Playing Time`
#> # i Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#>