我用 Rvest 做了几次网络抓取。但是我没有尝试从请求中获取字符(空)的地方进行网络抓取。这是否表明该网站阻止我从他们的网站上抓取数据?或者这是某种类型的 Javascript/Json 查询?
library(rvest)
library(robotstxt)
##checking the website Rvest and Robotstxt
paths_allowed("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")
njit <- read_html("https://www.ratemyprofessors.com/search/teachers?query=*&sid=668htm")
##Checking file type
class(njit)
##extracting professor names
prof <- njit %>%
xml_nodes(".cJdVEK") %>%
html_text2()
我假设你想拉所有与新泽西理工学院相关的教授?我在这个链接上抓取了页面:https://www.ratemyprofessors.com/search/teachers?query=*&sid=668 (这是原始链接减去末尾的“htm”)。
因为页面使用 JavaScript 返回 html
rvest
看到的内容与用户看到的 html 不同。此外,结果会在用户向下滚动时动态加载。这是一种使用 RSelenium 使 Web 浏览器自动化以继续滚动直到找到这所大学的所有 5,000 名左右教授的方法:
# load libraries
library(RSelenium)
library(rvest)
library(magrittr)
library(readr)
# define target url
url <- "https://www.ratemyprofessors.com/search/teachers?query=*&sid=668"
# start RSelenium ------------------------------------------------------------
rD <- rsDriver(browser="firefox", port=4550L, chromever = NULL)
remDr <- rD[["client"]]
# open the remote driver-------------------------------------------------------
# If it doesn't open automatically:
remDr$open()
# Navigate to webpage -----------------------------------------------------
remDr$navigate(url)
# Close "this site uses cookies" button
remDr$findElement(using = "css",value = "button.Buttons__Button-sc-19xdot-1:nth-child(3)")$clickElement()
# Find the number of profs
# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>%
read_html()
# extract the number of results
number_of_profs <- page_html %>%
html_node("h1") %>%
html_text() %>%
parse_number()
# Define a variable for the number of results we've pulled
number_of_profs_pulled <- 0
# While the number of scraped results is less than the number of total results we keep
# scrolling and pulling the html
while(number_of_profs > number_of_profs_pulled){
# scroll down the page
# Root is the html id of the container that the search results
# we want to scroll just to the bottom of the search results not the bottom
# of the page, because it looks like the
# "click for more results" button doesn't appear in the html
# unless you're litterally right at that part of the page
webElem <- remDr$findElement("css", ".SearchResultsPage__StyledSearchResultsPage-vhbycj-0")
#webElem$sendKeysToElement(list(key = "end"))
webElem$sendKeysToElement(list(key = "down_arrow"))
# click on the show more button ------------------------------------
remDr$findElement(using = "css",value = ".Buttons__Button-sc-19xdot-1")$clickElement()
# pull the webpage html
# then read it
page_html <- remDr$getPageSource()[[1]] %>%
read_html()
##extract professor names
prof_names <- page_html %>%
html_nodes(".cJdVEK") %>%
html_text()
# update the number of profs we pulled
# so we know if we need to keep running the loop
number_of_profs_pulled <- length(prof_names)
}
> str(prof_names)
chr [1:1250] "David Whitebook" "Donald Getzin" "Joseph Frank" "Soroosh Mohebbi" "Robert Lynch" "Don Wall" "Denis Blackmore" "Soha Abdeljaber" "Lamine Dieng" "Yehoshua Perl" "Douglas Burris" ...
>
注意事项: