在这个网站上,https://www.covers.com/sport/basketball/nba/matchup/290850/props,这是一个动态页面,有两种方式:
A)我正在努力使用 rvest 尝试将每行信息(如下)加载到表中 我的名字 2) 道具价值 3)预测值 4) 最佳赔率 5)分析
B) 在 R 中,如何更新“props”选择器,从而能够从 A 下载相同的数据。
我一直在用头撞显示器,因为我可以下降到“节点级别”,但我正在努力如何解析最低级别的信息
封面页<- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"
tmp<- read_html(covers_page)
nodes_1 <- tmp %>% html_elements("div") %>% xml_find_all("//div[包含(@class,'player-props-table-container')]"
B: 通过浏览器开发者的网络选项卡检查 HTTP 请求时。工具,您应该注意到 "props" 下拉菜单会触发 Ajax 调用(例如
... /290850/market?propEvent=NBA_GAME_PLAYER_POINTS
),为每个 prop 获取表格内容;由于 rvest
无法运行 javascript,因此这些调用的 URL 应根据下拉项值制作。因此,首先我们需要 propEvent
的有效值列表:
library(rvest)
library(dplyr)
library(purrr)
library(stringr)
url_ <- "https://www.covers.com/sport/basketball/nba/matchup/290850/props"
prop_events <-
read_html(url_) |>
html_elements("li[data-event-name]") |>
map(\(elem) list(event = html_attr(elem, "data-event-name"),
descr = html_text(elem))) |>
bind_rows()
prop_events
#> # A tibble: 12 × 2
#> event descr
#> <chr> <chr>
#> 1 NBA_GAME_PLAYER_POINTS Points Scored
#> 2 NBA_GAME_PLAYER_POINTS_REBOUNDS Points and Rebounds
#> 3 NBA_GAME_PLAYER_POINTS_ASSISTS Points and Assists
#> 4 NBA_GAME_PLAYER_3_POINTERS_MADE 3-Pointers Made
#> 5 NBA_GAME_PLAYER_REBOUNDS_ASSISTS Rebounds and Assists
#> 6 NBA_GAME_PLAYER_STEALS_BLOCKS Steals and Blocks
#> 7 NBA_GAME_PLAYER_BLOCKS Total Blocks
#> 8 NBA_GAME_PLAYER_STEALS Total Steals
#> 9 NBA_GAME_PLAYER_REBOUNDS Total Rebounds
#> 10 NBA_GAME_PLAYER_POINTS_REBOUNDS_ASSISTS Total Points, Rebounds, and Assists
#> 11 NBA_GAME_PLAYER_TURNOVERS Total Turnovers
#> 12 NBA_GAME_PLAYER_ASSISTS Total Assists
# url for props Ajax calls
(url_market <- str_replace(url_, "props$", "market?propEvent="))
#> [1] "https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent="
A: 您通常希望 CSS 选择器更加具体,而不仅仅是简单的
div
。从 html_element()
/ html_elements()
返回的元素可以传递给下一个 html_element()
/ html_elements()
调用,这意味着您可以首先选择所有文章 ( article.player-prop-article
),然后迭代元素列表并从中提取感兴趣的位每一篇文章。
# fetch content and process rows (player-prop-article), return tibble
parse_prop <- function(event_url){
read_html(event_url) |>
html_elements("article.player-prop-article") |>
map(\(art) list(
name = html_element(art, ".player-headshot-name strong") |> html_text(),
prop = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
proj = html_element(art, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text(),
odds = html_element(art, ".player-bestOdds-row > a > div > span") |> html_text(),
art = html_element(art, ".player-analysis") |> html_text())) |>
bind_rows()
}
# call parse_prop() on first three propEvents,
props <-
prop_events$event[1:3] |>
set_names() |>
map(\(event) str_c(url_market, event)) |>
map(parse_prop, .progress = TRUE) |>
list_rbind(names_to = "prop_event")
props
#> # A tibble: 39 × 6
#> prop_event name prop proj odds art
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 NBA_GAME_PLAYER_POINTS Ja Morant 25.5 22.4 -120 Offensive reboun…
#> 2 NBA_GAME_PLAYER_POINTS Jaren Jackson Jr. 18.5 21.5 -125 Jaren Jackson Jr…
#> 3 NBA_GAME_PLAYER_POINTS Jonas Valanciunas 15.5 14.2 -114 Out of all playe…
#> 4 NBA_GAME_PLAYER_POINTS Vince Williams Jr. 6.5 8.2 -150 Vince Williams J…
#> 5 NBA_GAME_PLAYER_POINTS CJ McCollum 17.5 19.4 -125 CJ McCollum has …
#> 6 NBA_GAME_PLAYER_POINTS Santi Aldama 8.5 9.5 -110 The Memphis Griz…
#> 7 NBA_GAME_PLAYER_POINTS Herbert Jones 9.5 10.2 -110 Herbert Jones ha…
#> 8 NBA_GAME_PLAYER_POINTS Trey Murphy III 12.5 13.5 -130 Among all player…
#> 9 NBA_GAME_PLAYER_POINTS Bismack Biyombo 6.5 6 -140 Bismack Biyombo …
#> 10 NBA_GAME_PLAYER_POINTS David Roddy 7.5 7.9 -106 David Roddy has …
#> # ℹ 29 more rows
也许更常见的方法是从文档/父元素中提取列向量并将它们组合到 data.frame / tibble,如下所示:
html <- read_html("https://www.covers.com/sport/basketball/nba/matchup/290850/market?propEvent=NBA_GAME_PLAYER_POINTS")
tibble(
name = html_elements(html, ".player-headshot-name strong") |> html_text(),
prop = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(1) strong") |> html_text(),
proj = html_elements(html, ".player-props-projection-bestOdds-div > div:nth-child(2) strong") |> html_text()
)
虽然它也往往比迭代元素更快,但它的鲁棒性较差,因为它仅在输入向量最终不可能具有不同长度时才有效。