R 屏幕抓取下拉值

问题描述 投票:0回答:1

在这个网站上,https://www.covers.com/sport/basketball/nba/matchup/290850/props,这是一个动态页面,有两种方式:

  1. 取决于相同的选择(其中 #######/props 值会根据游戏而不同)
  2. 此屏幕有一个下拉选择器,称为“props”,一旦更改,将更新页面以显示与选择器关联的信息。

A)我正在努力使用 rvest 尝试将每行信息(如下)加载到表中 我的名字 2) 道具价值 3)预测值 4) 最佳赔率 5)分析

B) 在 R 中,如何更新“props”选择器,从而能够从 A 下载相同的数据。

我一直在用头撞显示器,因为我可以下降到“节点级别”,但我正在努力如何解析最低级别的信息

封面页<- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"

tmp<- read_html(covers_page)

nodes_1 <- tmp %>% html_elements("div") %>% xml_find_all("//div[包含(@class,'player-props-table-container')]"

r web-scraping rvest
1个回答
0
投票

B: 通过浏览器开发者的网络选项卡检查 HTTP 请求时。工具,您应该注意到 "props" 下拉菜单会触发 Ajax 调用(例如

... /290850/market?propEvent=NBA_GAME_PLAYER_POINTS
),为每个 prop 获取表格内容;由于
rvest
无法运行 javascript,因此这些调用的 URL 应根据下拉项值制作。因此,首先我们需要
propEvent
的有效值列表:

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

url_ <- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"
prop_events <- 
  read_html(url_) |>
  html_elements("li[data-event-name]") |>
  map(\(elem) list(event = html_attr(elem, "data-event-name"),
                   descr = html_text(elem))) |>
  bind_rows()
prop_events
#> # A tibble: 12 × 2
#>    event                                   descr                              
#>    <chr>                                   <chr>                              
#>  1 NBA_GAME_PLAYER_POINTS                  Points Scored                      
#>  2 NBA_GAME_PLAYER_POINTS_REBOUNDS         Points and Rebounds                
#>  3 NBA_GAME_PLAYER_POINTS_ASSISTS          Points and Assists                 
#>  4 NBA_GAME_PLAYER_3_POINTERS_MADE         3-Pointers Made                    
#>  5 NBA_GAME_PLAYER_REBOUNDS_ASSISTS        Rebounds and Assists               
#>  6 NBA_GAME_PLAYER_STEALS_BLOCKS           Steals and Blocks                  
#>  7 NBA_GAME_PLAYER_BLOCKS                  Total Blocks                       
#>  8 NBA_GAME_PLAYER_STEALS                  Total Steals                       
#>  9 NBA_GAME_PLAYER_REBOUNDS                Total Rebounds                     
#> 10 NBA_GAME_PLAYER_POINTS_REBOUNDS_ASSISTS Total Points, Rebounds, and Assists
#> 11 NBA_GAME_PLAYER_TURNOVERS               Total Turnovers                    
#> 12 NBA_GAME_PLAYER_ASSISTS                 Total Assists

# url for props Ajax calls
(url_market <- str_replace(url_, "props$", "market?propEvent="))
#> [1] "https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent="

A: 您通常希望 CSS 选择器更加具体,而不仅仅是简单的

div
。从
html_element()
/
html_elements()
返回的元素可以传递给下一个
html_element()
/
html_elements()
调用,这意味着您可以首先选择所有文章 (
article.player-prop-article
),然后迭代元素列表并从中提取感兴趣的位每一篇文章。

# fetch content and process rows (player-prop-article), return tibble
parse_prop <- function(event_url){
  read_html(event_url) |>
  html_elements("article.player-prop-article") |>
  map(\(art) list(
    name = html_element(art, ".player-headshot-name strong") |> html_text(),
    prop = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
    proj = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text(),
    odds = html_element(art, ".player-bestOdds-row > a > div > span") |> html_text(),
    art  = html_element(art, ".player-analysis") |> html_text())) |>
  bind_rows()
}

# call parse_prop() on first three propEvents,
props <- 
  prop_events$event[1:3] |>
  set_names() |>
  map(\(event) str_c(url_market, event)) |>
  map(parse_prop, .progress = TRUE) |>
  list_rbind(names_to = "prop_event")
props
#> # A tibble: 39 × 6
#>    prop_event             name               prop  proj  odds  art              
#>    <chr>                  <chr>              <chr> <chr> <chr> <chr>            
#>  1 NBA_GAME_PLAYER_POINTS Ja Morant          25.5  22.4  -120  Offensive reboun…
#>  2 NBA_GAME_PLAYER_POINTS Jaren Jackson Jr.  18.5  21.5  -125  Jaren Jackson Jr…
#>  3 NBA_GAME_PLAYER_POINTS Jonas Valanciunas  15.5  14.2  -114  Out of all playe…
#>  4 NBA_GAME_PLAYER_POINTS Vince Williams Jr. 6.5   8.2   -150  Vince Williams J…
#>  5 NBA_GAME_PLAYER_POINTS CJ McCollum        17.5  19.4  -125  CJ McCollum has …
#>  6 NBA_GAME_PLAYER_POINTS Santi Aldama       8.5   9.5   -110  The Memphis Griz…
#>  7 NBA_GAME_PLAYER_POINTS Herbert Jones      9.5   10.2  -110  Herbert Jones ha…
#>  8 NBA_GAME_PLAYER_POINTS Trey Murphy III    12.5  13.5  -130  Among all player…
#>  9 NBA_GAME_PLAYER_POINTS Bismack Biyombo    6.5   6     -140  Bismack Biyombo …
#> 10 NBA_GAME_PLAYER_POINTS David Roddy        7.5   7.9   -106  David Roddy has …
#> # ℹ 29 more rows

也许更常见的方法是从文档/父元素中提取列向量并将它们组合到 data.frame / tibble,如下所示:

html <- read_html("https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent=NBA_GAME_PLAYER_POINTS")
tibble(
  name = html_elements(html, ".player-headshot-name strong") |> html_text(),
  prop = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
  proj = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text()
)

虽然它也往往比迭代元素更快,但它的鲁棒性较差,因为它仅在输入向量最终不可能具有不同长度时才有效。

© www.soinside.com 2019 - 2024. All rights reserved.