R网站多层次的网站搜刮功能

Question

我是一个R网络搜刮的初学者。在这种情况下，我首先尝试用R做了一个简单的网络搜刮，这是我所做的工作。

从这个网站上整理出工作人员的详细资料(https:/science.kln.ac.lkdeptsimindex.phpstaffacademic-staff。)，这是我使用的代码。

library(rvest)
url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")
url %>% html_nodes(".sppb-addon-content") %>% html_text()

上面的代码是工作的，所有的排序数据都显示出来了，当你点击每个员工时，你可以得到另一个细节，如研究兴趣、专业领域、简介等。

当你点击每个员工时，你可以得到另一个细节，如研究兴趣，专业领域，简介等......。我怎样才能得到这些数据，并在上面的数据集中根据每个员工的情况显示这些数据？

Answer 1

下面的代码会让你得到每个教授页面的所有链接。从那里，你可以使用purrr的map_df或map函数将每个链接映射到另一组rvest调用。

最重要的是，要把功劳归于@hrbrmstr。跨越多个页面的R网络刮擦

链接的答案有微妙的不同，因为它是在一组数字上映射，而不是像下面的代码那样在URL的矢量上映射。

library(rvest)
library(purrr)
library(stringr)
library(dplyr)

url <- read_html("https://science.kln.ac.lk/depts/im/index.php/staff/academic-staff")

names <- url %>%
  html_nodes(".sppb-addon-content") %>%
  html_nodes("strong") %>%
  html_text()
#extract the names

names <- names[-c(3,4)]
#drop the head of department and blank space

names <- names %>%
  tolower() %>%
  str_extract_all("[:alnum:]+") %>%
  sapply(paste, collapse = "-")
#create a list of names separated by dashes, should be identical to link names

content <- url %>% 
  html_nodes(".sppb-addon-content") %>%
  html_text()

content <- content[! content %in% "+"]
#drop the "+" from the content

content_names <- data.frame(prof_name = names, content = content)
#make a df with the content and the names, note the prof_name column is the same as below
#this allows for joining later on

links <- url %>% 
  html_nodes(".sppb-addon-content") %>%
  html_nodes("strong") %>% 
  html_nodes("a") %>%
  html_attr("href")
#create a vector of href links

url_base <- "https://science.kln.ac.lk%s"
urls <- sprintf(url_base, links)
#create a vector of urls for the professor's pages


prof_info <- map_df(urls, function(x) {
  #create an anonymous function to pull the data

  prof_name <- gsub("https://science.kln.ac.lk/depts/im/index.php/", "", x)
  #extract the prof's name from the url

  page <- read_html(x)
  #read each page in the urls vector

  sections <- page %>%
    html_nodes(".sppb-panel-title") %>%
    html_text()
  #extract the section title

  info <- page %>%
    html_nodes(".sppb-panel-body") %>%
    html_nodes(".sppb-addon-content") %>%
    html_text()
  #extract the info from each section

  data.frame(sections = sections, info = info, prof_name = prof_name)
  #create a dataframe with the section titles as the column headers and the
  #info as the data in the columns

}) 
#note this returns a dataframe. Change map_df to map if you want a list
#of tibbles instead

prof_info <- inner_join(content_names, prof_info, by = "prof_name")
#joining the content from the first page to all the individual pages

不知道这是不是最干净或最有效的方法，但我认为这就是你要的。

R网站多层次的网站搜刮功能

问题描述投票：0回答：1

1个回答

最新问题

R网站多层次的网站搜刮功能

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1