如何使用 R 从网络表中抓取下一页?

问题描述 投票:0回答:1

我想从这个网页上抓取每场比赛的市场价值表:

https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1

我使用下面的代码确实成功了:

library(rvest)
library(dplyr)

tm <- read_html("https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1")

tbls <- html_nodes(tm, "table")

mydf <- html_table(tbls[1])
mydf <- as.data.frame(mydf)

虽然我有两个问题。

  1. 国家/地区名称不会出现在

    mydf
    中。 (如果我从网站手动复制表格并将其粘贴到 Excel 中,则标志将转换为国家/地区名称。)

  2. 我无法成功导入第 2 至 17 页,因为该表由 17 页组成。 (网址末尾的1与第1页无关)。

如果有人知道如何解决这个问题,非常感谢!

r web-scraping rvest
1个回答
0
投票

如果您检查这些分页链接,您会发现负责页码的参数。另一种选择是禁用网站的 JavaScript,您可能想要这样做,以便更好地了解

rvest
如何“查看”该内容。然后您将在地址栏中看到真实的 URL。

html_table()
难以处理嵌套表和属性值(比赛和国家/地区)。为了解决这些问题,我们可以删除
html_table()
并迭代
<tr>
元素来手动构建每个 tibble / data.frame 行。

或者我们可以首先操作 XML 树,以便相关的

<td>
元素仅包含所需的数据,即用
html_table()
可以处理的简单文本替换内联表格和链接/图像:

library(rvest)
library(xml2)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)

# replace nested <td> nodes with <td>image title</td> in XML tree
# xpath_img - xpath to image with title attribute
# xpath_td  - xpath to <td> nodes that will be replaced by <td>image title</td>
image_title_to_td <- function(table_elem, xpath_img, xpath_td){
  td_nodes <- 
    table_elem %>% 
    html_elements(xpath = xpath_img) %>% 
    html_attr("title") %>% 
    { str_glue("<td>{.}</td>") } %>% 
    map(read_xml)
  
  table_elem %>% 
    html_elements(xpath = xpath_td) %>% 
    xml_replace(td_nodes)
  
  table_elem
}

parse_table <- function(html){
  html %>% 
    html_element( "table.items") %>% 
    # handle competitions -- pull image title from inline table
    image_title_to_td("//td/table/tr/td/a/img", "//td[table]") %>% 
    # handle countries -- pull image title from 2nd <td>
    image_title_to_td("./tbody/tr/td[2]/img", "./tbody/tr/td[2]") %>% 
    html_table()
}


url_ <- "https://www.transfermarkt.com/wettbewerbe/europa/wettbewerbe?plus=1"
first_page <- read_html(url_)

# extract last page number (a number after "page=" in url)
last_page_n <- 
  first_page |>
  html_element("li.tm-pagination__list-item--icon-last-page a") |>
  html_attr("href") |>
  str_extract("(?<=page\\=)\\d+") |>
  as.numeric()

# generate URLs for pages 2 ... last_page,
# read all pages,
# insert previously fetched first page into the list,
# extract table from each fetched page,
# cobine list of tables into a single tibble,
# deal with group rows (First Tier, ..., National Youth Super Cup)
paste0(url_, "&page=", 2:last_page_n) %>% 
  map(read_html, .progress = TRUE) %>% 
  append(list(first_page), after = 0) %>% 
  map(parse_table) %>% 
  list_rbind() %>% 
  mutate(grp = Forum, grp = na_if(grp,""), .before = 1, .keep = "unused") %>% 
  fill(grp) %>% 
  filter(str_detect(Clubs, "^\\d+$"))

结果数据集:

#> # A tibble: 425 × 11
#>    grp        Competition        Country     Clubs Player `Avg. age` Foreigners
#>    <chr>      <chr>              <chr>       <chr> <chr>  <chr>      <chr>     
#>  1 First Tier Premier League     England     20    549    26.3       68.9 %    
#>  2 First Tier LaLiga             Spain       20    487    27.4       41.1 %    
#>  3 First Tier Serie A            Italy       20    565    26.1       63.4 %    
#>  4 First Tier Bundesliga         Germany     18    514    25.7       48.2 %    
#>  5 First Tier Ligue 1            France      18    479    25.2       57.0 %    
#>  6 First Tier Liga Portugal      Portugal    18    511    25.5       58.3 %    
#>  7 First Tier Süper Lig          Turkey      20    605    26.5       49.4 %    
#>  8 First Tier Eredivisie         Netherlands 18    471    24.4       47.6 %    
#>  9 First Tier Jupiler Pro League Belgium     16    447    24.6       57.9 %    
#> 10 First Tier Premier Liga       Russia      16    422    26.1       35.8 %    
#> # ℹ 415 more rows
#> # ℹ 4 more variables: `Game ratio of foreign players` <chr>,
#> #   `Goals per match` <chr>, `Average market value` <chr>, `Total value` <chr>

为了说明 XML 操作,这里是原始 HTML 中表格的第一行:

first_page <- read_html(url_)
first_page %>% 
  html_elements("table.items tbody tr") %>% 
  pluck(2) %>% 
  as.character() %>% 
  cat()
#> <tr class="odd">
#> <td class="hauptlink"><table class="inline-table"><tr>
#> <td>
#>             <a href="/premier-league/startseite/wettbewerb/GB1">
#>                 <img src="https://tmssl.akamaized.net/images/logo/tiny/gb1.png?lm=1521104656" title="Premier League" alt="Premier League" class="continental-league-emblem"></a>
#>         </td>
#>         <td>
#>             <a href="/premier-league/startseite/wettbewerb/GB1" title="Premier League">Premier League</a>
#>         </td>
#>     </tr></table></td>
#> <td class="zentriert"><img src="https://tmssl.akamaized.net/images/flagge/tiny/189.png?lm=1520611569" title="England" alt="England" class="flaggenrahmen"></td>
#> <td class="zentriert">20</td>
#> <td class="zentriert">549</td>
#> <td class="zentriert">26.4</td>
#> <td class="zentriert"><a href="/premier-league/gastarbeiter/wettbewerb/GB1">68.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/legionaereeinsaetze/wettbewerb/GB1">71.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/gesamtspielplan/wettbewerb/GB1">3.08</a></td>
#> <td class="zentriert"></td>
#> <td class="rechts">€520.33m</td>
#> <td class="rechts hauptlink">€10.41bn</td>
#> </tr>

操作 XML 树后,请注意前 2 个

<td>
元素的变化:

first_page %>% 
  html_element("table.items") %>% 
  # change 1st <td> of every row, replace inline table
  image_title_to_td("//td/table/tr/td/a/img", "//td[table]") %>% 
  # change 2st <td> of every row, replace image
  image_title_to_td("./tbody/tr/td[2]/img", "./tbody/tr/td[2]") %>% 
  html_elements("tbody tr") %>% 
  pluck(2) %>% 
  as.character() %>% 
  cat()
#> <tr class="odd">
#> <td>Premier League</td>
#> <td>England</td>
#> <td class="zentriert">20</td>
#> <td class="zentriert">549</td>
#> <td class="zentriert">26.4</td>
#> <td class="zentriert"><a href="/premier-league/gastarbeiter/wettbewerb/GB1">68.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/legionaereeinsaetze/wettbewerb/GB1">71.9 %</a></td>
#> <td class="zentriert"><a href="/premier-league/gesamtspielplan/wettbewerb/GB1">3.08</a></td>
#> <td class="zentriert"></td>
#> <td class="rechts">€520.33m</td>
#> <td class="rechts hauptlink">€10.41bn</td>
#> </tr>

创建于 2023-09-19,使用 reprex v2.0.2

© www.soinside.com 2019 - 2024. All rights reserved.