Rvest 网页抓取,字符(空)

问题描述 投票:0回答:1

我用 Rvest 做了几次网络抓取。但是我没有尝试从请求中获取字符(空)的地方进行网络抓取。这是否表明该网站阻止我从他们的网站上抓取数据?或者这是某种类型的 Javascript/Json 查询?

library(rvest)
library(robotstxt)

##checking the website Rvest and Robotstxt
paths_allowed("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")
njit <- read_html("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")

##Checking file type
class(njit)

##extracting professor names
prof <- njit %>%
  xml_nodes(".cJdVEK") %>%
  html_text2()
r rvest
1个回答
0
投票

我假设你想拉所有与新泽西理工学院相关的教授?我在这个链接上抓取了页面:https://www.ratemyprofessors.com/search/teachers?query=*&sid=668 (这是原始链接减去末尾的“htm”)。

因为页面使用 JavaScript 返回 html

rvest
看到的内容与用户看到的 html 不同。此外,结果会在用户向下滚动时动态加载。这是一种使用 RSelenium 使 Web 浏览器自动化以继续滚动直到找到这所大学的所有 5,000 名左右教授的方法:

# load libraries
library(RSelenium)
library(rvest)
library(magrittr)
library(readr)

# define target url
url <- "https://www.ratemyprofessors.com/search/teachers?query=*&sid=668"


# start RSelenium ------------------------------------------------------------

rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]

# open the remote driver-------------------------------------------------------
# If it doesn't open automatically:
remDr$open()

# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)


# Close "this site uses cookies" button
remDr$findElement(using = "css",value = "button.Buttons__Button-sc-19xdot-1:nth-child(3)")$clickElement()


# Find the number of profs
# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()

# extract the number of results
number_of_profs <- page_html %>% 
                  html_node("h1") %>% 
                  html_text() %>% 
                  parse_number()


# Define a variable for the number of results we've pulled
number_of_profs_pulled <- 0


# While the number of scraped results is less than the number of total results we keep
# scrolling and pulling the html

while(number_of_profs > number_of_profs_pulled){


# scroll down the page
# Root is the html id of the container that the search results
# we want to scroll just to the bottom of the search results not the bottom
# of the page, because it looks like the 
# "click for more results" button doesn't appear in the html 
# unless you're litterally right at that part of the page
webElem <- remDr$findElement("css", ".SearchResultsPage__StyledSearchResultsPage-vhbycj-0")
#webElem$sendKeysToElement(list(key = "end"))
webElem$sendKeysToElement(list(key = "down_arrow"))


# click on the show more button ------------------------------------
remDr$findElement(using = "css",value = ".Buttons__Button-sc-19xdot-1")$clickElement()


# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>% 
  read_html()


##extract professor names
prof_names <- page_html %>%
  html_nodes(".cJdVEK") %>%
  html_text()


# update the number of profs we pulled
# so we know if we need to keep running the loop
number_of_profs_pulled <- length(prof_names)

}

结果

> str(prof_names)
 chr [1:1250] "David Whitebook" "Donald Getzin" "Joseph Frank" "Soroosh Mohebbi" "Robert Lynch" "Don Wall" "Denis Blackmore" "Soha Abdeljaber" "Lamine Dieng" "Yehoshua Perl" "Douglas Burris" ...
> 

注意事项:

  1. 这会很慢,因为你必须一直等待页面重新加载。
  2. 您可能希望进一步放慢速度,以避免该站点将您作为机器人阻止。您还可以使用 RSelenium 添加随机鼠标和按键移动,以降低被阻止的风险。
© www.soinside.com 2019 - 2024. All rights reserved.