使用 R 从足球参考中抓取阵容数据

问题描述 投票:0回答:1

我似乎总是在使用 Python 或 R 抓取参考站点时遇到问题。每当我在 R 中使用正常的 xpath 方法 (Python) 或 Rvest 方法时,我想要的表似乎永远不会被抓取器拾取。

library(rvest)

url = 'https://www.pro-football-reference.com/years/2016/games.htm'

webpage = read_html(url)

table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)

for(x in boxscore_links{
  keep = substr(x, 10, 36)
  url2 = paste('https://www.pro-football-reference.com', keep, sep = "") 
  webpage2 = read_html(url2)
  home_team = webpage2 %>% html_nodes(xpath='//*[@id="all_home_starters"]') %>% html_text()
  away_team = webpage2 %>% html_nodes(xpath='//*[@id="all_vis_starters"]') %>% html_text()
  home_starters = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_text()
  home_starters2 = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_table()
  #code that will bind lineup tables with some master table -- code to be written later 
}

我正在努力争夺首发阵容表。第一段代码提取 2016 年所有 boxscore 的 url,然后 for 循环转到每个 boxscore 页面,希望提取由“Insert Team Here”Starters 领导的表格。

这里有一个链接,例如:'https://www.pro-football-reference.com/boxscores/201609110rav.htm'

当我运行上面的代码时, home_starters 和 home_starters2 对象包含零个元素(理想情况下它应该包含我试图引入的表或表的元素)。

我很感谢您的帮助!

r xpath rvest
1个回答
3
投票

我花了过去三个小时试图解决这个问题。这才是应该做的。这是我的例子,但我相信你可以将其应用到你的例子中。

"https://www.pro-football-reference.com/years/2017/" %>%
  read_html() %>%
  html_nodes(xpath = "//comment()") %>% # select comments
  html_text() %>% # extract comment text
  paste(collapse = "") %>% # collapse to single string
  read_html() %>% # reread as HTML
  html_node("table#returns") %>% # select desired node
  html_table() 
© www.soinside.com 2019 - 2024. All rights reserved.