我想从这个网页上抓取每场比赛的市场价值表:
https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1
我使用下面的代码确实成功了:
library(rvest)
library(dplyr)
tm <- read_html("https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1")
tbls <- html_nodes(tm, "table")
mydf <- html_table(tbls[1])
mydf <- as.data.frame(mydf)
虽然我有两个问题。
国家/地区名称不会出现在
mydf
中。 (如果我从网站手动复制表格并将其粘贴到 Excel 中,则标志将转换为国家/地区名称。)
我无法成功导入第 2 至 17 页,因为该表由 17 页组成。 (网址末尾的1与第1页无关)。
如果有人知道如何解决这个问题,非常感谢!
如果您检查这些分页链接,您会发现负责页码的参数。另一种选择是禁用网站的 JavaScript,您可能想要这样做,以便更好地了解
rvest
如何“查看”该内容。然后您将在地址栏中看到真实的 URL。
html_table()
难以处理嵌套表和属性值(比赛和国家/地区)。为了解决这些问题,我们可以删除 html_table()
并迭代 <tr>
元素来手动构建每个 tibble / data.frame 行。
或者我们可以首先操作 XML 树,以便相关的
<td>
元素仅包含所需的数据,即用 html_table()
可以处理的简单文本替换内联表格和链接/图像:
library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
# replace nested <td> nodes with <td>image title</td> in XML tree
# xpath_img - xpath to image with title attribute
# xpath_td - xpath to <td> nodes that will be replaced by <td>image title</td>
image_title_to_td <- function(table_elem, xpath_img, xpath_td){
td_nodes <-
table_elem %>%
html_elements(xpath = xpath_img) %>%
html_attr("title") %>%
{ str_glue("<td>{.}</td>") } %>%
map(read_xml)
table_elem %>%
html_elements(xpath = xpath_td) %>%
xml_replace(td_nodes)
table_elem
}
parse_table <- function(html){
html %>%
html_element( "table.items") %>%
# handle competitions -- pull image title from inline table
image_title_to_td("//td/table/tr/td/a/img", "//td[table]") %>%
# handle countries -- pull image title from 2nd <td>
image_title_to_td("./tbody/tr/td[2]/img", "./tbody/tr/td[2]") %>%
html_table()
}
url_ <- "https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1"
first_page <- read_html(url_)
# extract last page number (a number after "page=" in url)
last_page_n <-
first_page |>
html_element("li.tm-pagination__list-item--icon-last-page a") |>
html_attr("href") |>
str_extract("(?<=page\\=)\\d+") |>
as.numeric()
# generate URLs for pages 2 ... last_page,
# read all pages,
# insert previously fetched first page into the list,
# extract table from each fetched page,
# cobine list of tables into a single tibble,
# deal with group rows (First Tier, ..., National Youth Super Cup)
paste0(url_, "&page=", 2:last_page_n) %>%
map(read_html, .progress = TRUE) %>%
append(list(first_page), after = 0) %>%
map(parse_table) %>%
list_rbind() %>%
mutate(grp = Forum, grp = na_if(grp,""), .before = 1, .keep = "unused") %>%
fill(grp) %>%
filter(str_detect(Clubs, "^\\d+$"))
结果数据集:
#> # A tibble: 425 × 11
#> grp Competition Country Clubs Player `Avg. age` Foreigners
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 First Tier Premier League England 20 549 26.3 68.9 %
#> 2 First Tier LaLiga Spain 20 487 27.4 41.1 %
#> 3 First Tier Serie A Italy 20 565 26.1 63.4 %
#> 4 First Tier Bundesliga Germany 18 514 25.7 48.2 %
#> 5 First Tier Ligue 1 France 18 479 25.2 57.0 %
#> 6 First Tier Liga Portugal Portugal 18 511 25.5 58.3 %
#> 7 First Tier Süper Lig Turkey 20 605 26.5 49.4 %
#> 8 First Tier Eredivisie Netherlands 18 471 24.4 47.6 %
#> 9 First Tier Jupiler Pro League Belgium 16 447 24.6 57.9 %
#> 10 First Tier Premier Liga Russia 16 422 26.1 35.8 %
#> # ℹ 415 more rows
#> # ℹ 4 more variables: `Game ratio of foreign players` <chr>,
#> # `Goals per match` <chr>, `Average market value` <chr>, `Total value` <chr>
为了说明 XML 操作,这里是原始 HTML 中表格的第一行:
first_page <- read_html(url_)
first_page %>%
html_elements("table.items tbody tr") %>%
pluck(2) %>%
as.character() %>%
cat()
#> <tr class="odd">
#> <td class="hauptlink"><table class="inline-table"><tr>
#> <td>
#> <a href="/premier-league/startseite/wettbewerb/GB1">
#> <img src="https://tmssl.akamaized.net/images/logo/tiny/gb1.png?lm=1521104656" title="Premier League" alt="Premier League" class="continental-league-emblem"></a>
#> </td>
#> <td>
#> <a href="/premier-league/startseite/wettbewerb/GB1" title="Premier League">Premier League</a>
#> </td>
#> </tr></table></td>
#> <td class="zentriert"><img src="https://tmssl.akamaized.net/images/flagge/tiny/189.png?lm=1520611569" title="England" alt="England" class="flaggenrahmen"></td>
#> <td class="zentriert">20</td>
#> <td class="zentriert">549</td>
#> <td class="zentriert">26.4</td>
#> <td class="zentriert"><a href="/premier-league/gastarbeiter/wettbewerb/GB1">68.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/legionaereeinsaetze/wettbewerb/GB1">71.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/gesamtspielplan/wettbewerb/GB1">3.08</a></td>
#> <td class="zentriert"></td>
#> <td class="rechts">€520.33m</td>
#> <td class="rechts hauptlink">€10.41bn</td>
#> </tr>
操作 XML 树后,请注意前 2 个
<td>
元素的变化:
first_page %>%
html_element("table.items") %>%
# change 1st <td> of every row, replace inline table
image_title_to_td("//td/table/tr/td/a/img", "//td[table]") %>%
# change 2st <td> of every row, replace image
image_title_to_td("./tbody/tr/td[2]/img", "./tbody/tr/td[2]") %>%
html_elements("tbody tr") %>%
pluck(2) %>%
as.character() %>%
cat()
#> <tr class="odd">
#> <td>Premier League</td>
#> <td>England</td>
#> <td class="zentriert">20</td>
#> <td class="zentriert">549</td>
#> <td class="zentriert">26.4</td>
#> <td class="zentriert"><a href="/premier-league/gastarbeiter/wettbewerb/GB1">68.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/legionaereeinsaetze/wettbewerb/GB1">71.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/gesamtspielplan/wettbewerb/GB1">3.08</a></td>
#> <td class="zentriert"></td>
#> <td class="rechts">€520.33m</td>
#> <td class="rechts hauptlink">€10.41bn</td>
#> </tr>
创建于 2023-09-19,使用 reprex v2.0.2