我正在一组网页上运行map_dfr,以提取将捆绑到数据框中的各种元素。我做了一个简单的例子来展示这个问题,只需循环浏览两页。在两者的第一页中,有一个“防御”表,但循环的第二页中没有。 (两者都有一个进攻表 - 仅供参考)
我正在寻找如何才能防止不包含此元素(或其他元素,但我只是使用此元素作为示例)的任何页面中断循环。可能的函数出现了,所以我将其合并到下面的尝试中。问题似乎是代码确实为迭代 #2 中不存在的防御表创建了“na”,但是当需要将“防御”添加为底部的列时,它会给出底部包含的错误我的代码。我认为这与列是嵌套表这一事实有关,但我不确定。
我的目标是通过将嵌套防御表保留为一列来运行此循环,并将任何不包含该表的页面设为“NA”或空白。
首选 DPLYR 解决方案,但对任何有效的方法都开放。
library(tidyverse)
library(rvest)
library(RSelenium)
library(netstat)
rs_driver_object <- rsDriver(browser = "firefox",
verbose = F,
chromever = NULL,
port = free_port())
remDr <- rs_driver_object$client
games <- c("https://www.pro-football-reference.com/boxscores/197301140mia.htm",
"https://www.pro-football-reference.com/boxscores/196010230was.htm")
remDr <- rs_driver_object$client
test_df <-
map_dfr(games,
function(game_pull){
Sys.sleep(3)
remDr$navigate(game_pull)
x <- remDr$getPageSource() %>% unlist()
page <- read_html(x)
szn <-
page |>
html_elements(xpath = "//*[@class='hoversmooth']") |>
html_text2() |>
parse_number()
offense <- map_dfr(page,
possibly(~
page |>
html_elements(xpath = "//*[@id='all_player_offense']") |>
html_table() |>
as.data.frame() |>
janitor::row_to_names(row_number = 1) |>
janitor::clean_names() |>
nest(),
otherwise = "na"
))
defense <- map_dfr(page,
possibly(~
page |>
html_elements(xpath = "//*[@id='all_player_defense']") |>
html_table() |>
as.data.frame() |>
janitor::row_to_names(row_number = 1) |>
janitor::clean_names() |>
nest(),
otherwise = "na"
))
df <- page |>
html_elements(xpath = "//table[@class='linescore nohover stats_table no_freeze']") |>
html_table() |>
as.data.frame() |>
setNames(c("trash", "team", 'q1', "q2", "q3", "q4", "final")) |>
mutate(offense = offense,
defense = defense)
df
})
Error in `map()`:
ℹ In index: 2.
Caused by error in `mutate()`:
ℹ In argument: `defense = defense`.
Caused by error:
! `defense` must be size 2 or 1, not 0.
不幸的是,该网站上的数据似乎组织得相当不一致。看起来你的主要问题是你试图在抛出错误时填充 NA,但 rvest 在找不到数据时并不总是抛出错误,有时会返回 NULL。如果我尝试抓取网站,我可能会使用类似于下面的内容,其中定义了一个实际的(非匿名)函数,然后可以将其应用于每个 URL。
您也确实不需要 RSelenium 来抓取这个网站,因为数据是随页面一起加载的,而不是在幕后用 Javascript 做任何花哨的事情,所以我删除了它来隔离问题本身。
library(rvest)
library(dplyr)
getGameData <- function(url){
page <- read_html(url)
szn <- page |>
html_elements(xpath = "//*[@class='hoversmooth']/li[2]") |>
html_text2() |>
readr::parse_number()
game_data <- page %>%
html_elements(xpath = "//table[@class='linescore nohover stats_table no_freeze']") %>%
html_table() %>%
as.data.frame() %>%
setNames(c("trash", "team", 'q1', "q2", "q3", "q4", "final")) %>%
select(team:final)
offense_elem <- html_element(page, xpath = "//*[@id='all_player_offense']")
if(is(offense_elem, "xml_node")){
offense_data <- html_table(offense_elem, header = FALSE) %>%
filter(X3!="Passing") %>%
filter(X1!="Player") %>%
setNames(c("Player", "Tm", "Cmp", "Att", "Yds", "TD", "Int", "Sk", "Yds",
"Lng", "Rate", "Att", "Yds", "TD", "Lng", "Rec", "Yds", "TD",
"Lng")) %>%
split(.$Tm)
} else {
offense_data <- list(NA, NA)
}
game_data$offense <- offense_data
defense_elem <- html_element(page, xpath = "//*[@id='all_player_defense']")
if(is(defense_elem, "xml_node")){
defense_table <- html_table(defense_elem, header = FALSE)
if(nrow(defense_table)>0){
defense_data <- defense_table %>%
filter(X3!="Passing") %>%
filter(X1!="Player") %>%
setNames(c("Player", "Tm", "Cmp", "Att", "Yds", "TD", "Int", "Sk", "Yds",
"Lng", "Rate", "Att", "Yds", "TD", "Lng", "Rec", "Yds", "TD",
"Lng")) %>%
split(.$Tm)
} else {
defense_data <- list(NA, NA)
}
} else {
defense_data <- list(NA, NA)
}
game_data$defense <- defense_data
return(game_data)
}
games <- c("https://www.pro-football-reference.com/boxscores/197301140mia.htm",
"https://www.pro-football-reference.com/boxscores/196010230was.htm")
all_game_data <- lapply(games, getGameData) %>%
bind_rows()
通过完整的功能,我们能够为节点不存在(class=“xml_missing”而不是“xml_node”)以及节点存在但返回空表时定义更好的错误处理逻辑。
不幸的是,您共享的两个页面似乎都缺少
all_player_defense
节点,因此我不得不猜测一下预期的格式。