通过循环遍历一个非常大的 URL 列表并保留存在的 URL 来查找每个 gameId

问题描述 投票:0回答:1

我正在尝试从这里获取每个 boxscore url 的所有

gameId
的列表:

https://www.espn.com/nhl/boxscore/_/gameId/

每个 URL 以特定的

gameID
结尾,例如

https://www.espn.com/nhl/boxscore/_/gameId/4014559236

我遇到的问题是我不知道所有

gameId
的范围或数量。对于 2023-2024 赛季开始,它们似乎以
4014559236
开头并递增 1。但是,例如 2007-2008 赛季开始时,它们以
271009021
开始。

我想从尽可能远的地方得到它们。

我使用了here找到的代码,它允许我指定一些

gameId
,检查URL是否存在,如果存在,则输出
gameId

我的代码仅使用 2023-2024 赛季开始时的三个

gameId

library(httr)
library(purrr)
library(RCurl)

urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)

safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>% 
  dplyr::filter(logical=="FALSE")
dead_urls <- temp %>% 
  dplyr::filter(logical=="TRUE")

df_exist <- list()

for (i in 1:nrow(safe_urls)) {
  url <- as.character(safe_urls$url[i])
  exist <- url.exists(url)
  df_exist <- rbind(df_exist, url)
}

urls = df_exist

game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"

但如果我要指定从

271009021
4014559236
,需要检查的数字和 URL 数量非常大。

是否有其他方法可以提高速度和效率?

我还想获得每场比赛的日期,尽管我还没有找到。

r purrr httr rcurl
1个回答
0
投票

您可以从每年每个团队的日程开始。例如:https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022(Ducks for 2022-23 season)并从“结果”列中提取游戏ID。

这是代码:

url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)

#get the main table
schedule <- page %>% html_elements("table") 

#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
                    html_attr("href")


 [1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
 [3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
 [5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
 [7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
 [9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293
© www.soinside.com 2019 - 2024. All rights reserved.