我有一个网页:http://probabilityfootball.com/picks.html?1520027255&username=AVERAGES&weeknum=21
从本网站上的表格中,我试图提取两个团队,获胜者,选择%和分数,然后将此信息转换为数据框。
我知道我可以结合使用greprexpr()和regmatches()来提取我需要的信息。我也知道表格的每个单元格都以<TD>
开头,以</TD>
结尾,但我需要这些标签之间的信息。到目前为止,我有:
library(rcurl)
htmlCode <- getURL("http://probabilityfootball.com/picks.html?1520027255&username=AVERAGES&weeknum=21")
data <- regmatches(htmlCode, grepexpr(pattern = "<TD>.+?</TD>))
但是,这将返回一个包含29个不同字符的列表,并且没有接近我想要的位置。我不知道从哪里开始。
如果有人有任何意见,将不胜感激。如果有人发布代码,我会受益于它尽可能明确。网页抓取和使用正则表达式不是我的强项,并且想要了解代码而不仅仅是复制面食。谢谢!
由于表结构相当混乱,您可能需要考虑首先将该表作为文本读取。由于我假设您可能想要刮几周,您可能需要考虑抽象出weeknum
,以便您可以在函数中使用它:
library(rvest)
library(tidyverse)
base_url <- "http://probabilityfootball.com/picks.html"
username <- "AVERAGES"
weeknum <- "21"
full_url <- paste0(base_url, "?username=", username, "&weeknum=", weeknum)
page <- read_html(full_url)
table_text <- page %>%
html_nodes("table") %>%
.[5] %>%
html_nodes("td") %>%
html_text()
table_matrix <- matrix(table_text, ncol = 9, byrow = TRUE)
col_names <- c("deadline", "kickoff", "home_team_name", "home_team_score",
"home_team_pick_pct", "score", "away_team_name", "away_team_score", "away_team_pick_pct")
colnames(table_matrix) <- col_names
result_df <- as_data_frame(table_matrix)
result_df
# # A tibble: 18 x 9
# deadline kickoff home_team_name home_team_score home_team_pick_… score away_team_name away_team_score
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Sat, 12/29N… Sat, 12/29… New England 38 86% 79.40 N.Y. Giants 35
# 2 Sun, 12/30N… Sun, 12/30… Buffalo 9 25% 60.03 Philadelphia 17
# 3 Sun, 12/30N… Sun, 12/30… Carolina 31 41% -62.20 Tampa Bay 23
# 4 Sun, 12/30N… Sun, 12/30… Cincinnati 38 63% 24.52 Miami 25
# 5 Sun, 12/30N… Sun, 12/30… Dallas 6 40% 8.05 Washington 27
# 6 Sun, 12/30N… Sun, 12/30… Detroit 13 30% 46.40 Green Bay 34
# 7 Sun, 12/30N… Sun, 12/30… Jacksonville 28 55% -44.94 Houston 42
这种方法仍然需要进行一些清理(例如,任何不以“星期几”开头的行,例如“领带破坏者”,“常规季节......”将需要删除)。