Web中的Web抓取html表格

问题描述 投票:2回答:1

我有一个网页:http://probabilityfootball.com/picks.html?1520027255&username=AVERAGES&weeknum=21

从本网站上的表格中,我试图提取两个团队,获胜者,选择%和分数,然后将此信息转换为数据框。

我知道我可以结合使用greprexpr()和regmatches()来提取我需要的信息。我也知道表格的每个单元格都以<TD>开头,以</TD>结尾,但我需要这些标签之间的信息。到目前为止,我有:

library(rcurl)  
htmlCode <- getURL("http://probabilityfootball.com/picks.html?1520027255&username=AVERAGES&weeknum=21")  
data <- regmatches(htmlCode, grepexpr(pattern = "<TD>.+?</TD>))  

但是,这将返回一个包含29个不同字符的列表,并且没有接近我想要的位置。我不知道从哪里开始。

如果有人有任何意见,将不胜感激。如果有人发布代码,我会受益于它尽可能明确。网页抓取和使用正则表达式不是我的强项,并且想要了解代码而不仅仅是复制面食。谢谢!

r regex web-scraping
1个回答
0
投票

由于表结构相当混乱,您可能需要考虑首先将该表作为文本读取。由于我假设您可能想要刮几周,您可能需要考虑抽象出weeknum,以便您可以在函数中使用它:

library(rvest)
library(tidyverse)

base_url <- "http://probabilityfootball.com/picks.html"
username <- "AVERAGES"
weeknum <- "21"
full_url <- paste0(base_url, "?username=", username, "&weeknum=", weeknum)

page <- read_html(full_url)

table_text <- page %>%
  html_nodes("table") %>%
  .[5] %>%
  html_nodes("td") %>%
  html_text()

table_matrix <- matrix(table_text, ncol = 9, byrow = TRUE)

col_names <- c("deadline", "kickoff", "home_team_name", "home_team_score",
               "home_team_pick_pct", "score", "away_team_name", "away_team_score", "away_team_pick_pct")
colnames(table_matrix) <- col_names

result_df <- as_data_frame(table_matrix)
result_df
# # A tibble: 18 x 9
#   deadline     kickoff     home_team_name home_team_score  home_team_pick_… score  away_team_name away_team_score
#   <chr>        <chr>       <chr>          <chr>            <chr>            <chr>  <chr>          <chr>          
# 1 Sat, 12/29N… Sat, 12/29… New England    38               86%              79.40  N.Y. Giants    35             
# 2 Sun, 12/30N… Sun, 12/30… Buffalo        9                25%              60.03  Philadelphia   17             
# 3 Sun, 12/30N… Sun, 12/30… Carolina       31               41%              -62.20 Tampa Bay      23             
# 4 Sun, 12/30N… Sun, 12/30… Cincinnati     38               63%              24.52  Miami          25             
# 5 Sun, 12/30N… Sun, 12/30… Dallas         6                40%              8.05   Washington     27             
# 6 Sun, 12/30N… Sun, 12/30… Detroit        13               30%              46.40  Green Bay      34             
# 7 Sun, 12/30N… Sun, 12/30… Jacksonville   28               55%              -44.94 Houston        42     

这种方法仍然需要进行一些清理(例如,任何不以“星期几”开头的行,例如“领带破坏者”,“常规季节......”将需要删除)。

© www.soinside.com 2019 - 2024. All rights reserved.