在R中刮擦eliteprospects.com

问题描述 投票:0回答:1

我试图在这个网页上抓100行的第一张表:https://www.eliteprospects.com/league/ushl/stats/2018-2019?sort=ppg

我找不到一个CSS来立刻刮掉整个表,所以我分别抓每个列,然后尝试将所有列组合成一个数据帧或tibble。

library(tidyverse)
library(rvest)


# Player----------------------------------------------------------------
url <- read_html("https://www.eliteprospects.com/league/ushl/stats/2018-2019?sort=ppg")

# Player column
player <- url %>% 
  html_nodes("#skater-stats .player") %>% 
  html_text() %>% 
  str_trim()

player <- player[-1]

# Clean player column
player_df <- data.frame(player) %>% 
  mutate(player = as.character(player)) %>% 
  # Filter out empty values (those that have nchar of 1)
  filter(nchar(player) > 0)


# Team----------------------------------------------------------------
team <- url %>% 
  html_nodes("#skater-stats .team") %>% 
  html_text() %>% 
  str_trim()

team_df <- data.frame(team) %>% 
  slice(-1) %>% 
  mutate(team = as.character(team))

team_df <- team_df %>% 
  filter(nchar(team) > 0)
# Number of Rows for Teams exceed 100 because some players played on several different teams throughout season


# Games Played-----------------------------------------------------------
gp <- url %>% 
  html_nodes("#skater-stats .gp") %>% 
  html_text() %>% 
  str_trim()

gp_df <- data.frame(games_played = gp) %>% 
  slice(-1) 

gp_df <- gp_df %>% 
  mutate(games_played = as.character(games_played)) %>% 
  filter(nchar(games_played) > 0)
# Number of Rows for Games played exceed 100 because some players played on several different teams throughout season


# Goals-----------------------------------------------------------
goals <- url %>% 
  html_nodes("#skater-stats .g") %>% 
  html_text() %>% 
  str_trim()

goals_df <- data.frame(goals) %>% 
  slice(-1)

goals_df <- goals_df %>% 
  mutate(goals = as.character(goals)) %>% 
  filter(nchar(goals) > 0)


# Assists-----------------------------------------------------------
assists <- url %>% 
  html_nodes("#skater-stats .a") %>% 
  html_text() %>% 
  str_trim()

assists_df <- data.frame(assists) %>% 
  slice(-1)

assists_df <- assists_df %>% 
  mutate(assists = as.character(assists)) %>% 
  filter(nchar(assists) > 0)


# Total Points-----------------------------------------------------------
total_points <- url %>% 
  html_nodes("#skater-stats .tp") %>% 
  html_text() %>% 
  str_trim()

total_points_df <- data.frame(total_points) %>% 
  slice(-1)

total_points_df <- total_points_df %>% 
  mutate(total_points = as.character(total_points)) %>% 
  filter(nchar(total_points) > 0)

我面临的问题是我在player_df中有100行球员数据,但是由于一些球员在多支球队上进行比赛,他们的数据有120行。例如,Brendan Furry(LW)参加了两支球队。

我如何删除各个团队的统计数据,只看那些以可重复的方式在多个团队中玩过的玩家的totals?我想多年来执行相同的功能,所以我想创建一个函数!

谢谢

r rvest
1个回答
0
投票

实际上你可以一次刮掉表,然后只过滤掉没有玩家名字的行:

url <- read_html("https://www.eliteprospects.com/league/ushl/stats/2018-2019?sort=ppg")

tab <- html_table(html_node(url,xpath="/html/body/section[2]/div/div[1]/div[4]/div[3]/div[1]/div/div[4]/table"))

tab <- tab %>% filter(Player != "")

head(tab,17)

    #                  Player                    Team GP  G  A TP  PPG PIM +/-
1   1       Alex Turcotte (C)          USNTDP Juniors 16 12 22 34 2.13  14  27
2   2         Jack Hughes (C)          USNTDP Juniors 24 12 36 48 2.00   4  17
3   3        Bobby Brink (RW)   Sioux City Musketeers 43 35 33 68 1.58  22  23
4   4      Matthew Boldy (LW)          USNTDP Juniors 28 17 26 43 1.54  16  14
5   5     Trevor Zegras (C/W)          USNTDP Juniors 27 14 26 40 1.48  34  18
6   6    Cole Caufield (C/RW)          USNTDP Juniors 28 29 12 41 1.46  23  19
7   7     Martin Pospisil (C)   Sioux City Musketeers 44 16 47 63 1.43 118  13
8   8       Ronnie Attard (D)          Tri-City Storm 48 30 34 64 1.33  66  46
9   9      Nick Abruzzese (F)           Chicago Steel 62 29 51 80 1.29  20  -1
10 10       Brett Murray (LW)     Youngstown Phantoms 62 41 35 76 1.23  35  13
11 11            Cam York (D)          USNTDP Juniors 28  7 26 33 1.18  12  40
12 12    Matias Maccelli (LW) Dubuque Fighting Saints 62 31 41 72 1.16  42   8
13 13  Mikael Hakkarainen (C)    Muskegon Lumberjacks 42 19 28 47 1.12  22  24
14 14     Michael Gildon (LW)          USNTDP Juniors 26 13 16 29 1.12  28  24
15 15 Robert Mastrosimone (C)           Chicago Steel 54 31 29 60 1.11  28   5
16 16       Ben Meyers (C/LW)             Fargo Force 59 33 32 65 1.10  26  11
17 17      Brendan Furry (LW)                  totals 52 21 36 57 1.10  20  12

© www.soinside.com 2019 - 2024. All rights reserved.