我正在尝试从这里获取每个 boxscore url 的所有
gameId
的列表:
https://www.espn.com/nhl/boxscore/_/gameId/
每个 URL 以特定的
gameID
结尾,例如
https://www.espn.com/nhl/boxscore/_/gameId/4014559236
我遇到的问题是我不知道所有
gameId
的范围或数量。对于 2023-2024 赛季开始,它们似乎以 4014559236
开头并递增 1。但是,例如 2007-2008 赛季开始时,它们以 271009021
开始。
我想从尽可能远的地方得到它们。
我使用了here找到的代码,它允许我指定一些
gameId
,检查URL是否存在,如果存在,则输出gameId
。
我的代码仅使用 2023-2024 赛季开始时的三个
gameId
:
library(httr)
library(purrr)
library(RCurl)
urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)
safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>%
dplyr::filter(logical=="FALSE")
dead_urls <- temp %>%
dplyr::filter(logical=="TRUE")
df_exist <- list()
for (i in 1:nrow(safe_urls)) {
url <- as.character(safe_urls$url[i])
exist <- url.exists(url)
df_exist <- rbind(df_exist, url)
}
urls = df_exist
game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"
但如果我要指定从
271009021
到 4014559236
,需要检查的数字和 URL 数量非常大。
是否有其他方法可以提高速度和效率?
我还想获得每场比赛的日期,尽管我还没有找到。
您可以从每年每个团队的日程开始。例如:https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022(Ducks for 2022-23 season)并从“结果”列中提取游戏ID。
这是代码:
url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)
#get the main table
schedule <- page %>% html_elements("table")
#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
html_attr("href")
[1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
[3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
[5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
[7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
[9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293