我试图让这个链接末尾的数字的范围:https://schedule.sxsw.com/2019/speakers/2008434
。
链路具有多个在端部,例如该2008434
。该链接指向音箱在即将举行的西南偏南音乐节的BIOS。我知道有3729个音箱达尔,但这并不能帮助我弄清楚每个扬声器及其相关页面是如何编号。
我试图做使用lapply
函数一些简单的网页抓取,但如果我不能指定一个范围我的功能不起作用。例如,我用:
number_range <- seq(1:3000000)
点击链接周围没有给出模式,他们是如何编号。
而我得到了很多Error in open.connection(x, "rb") : HTTP error 404.
的
有一个简单的方法来获得这个范围/启用此功能工作?下面的代码:
library(rvest)
library(tidyverse)
# List for bios
sxsw_bios <- list()
# Creating vector of numbers
number_range <- seq(1:3000000)
# Scraping bios with names
sxsw_bios <- lapply(number_range, function(y) {
# Getting speaker name
Name <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/",
paste0(y))) %>%
html_nodes(".speaker-name") %>%
html_text()
你可以刮的ID从扬声器页面列表
library(rvest)
ids <- lapply( letters, function(x) {
speakers <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/", x)) %>%
rvest::html_nodes(xpath = "//*[@class='favorite-click absolute']/@data-item-id")
speakers <- gsub(' data-item-id="|"',"",speakers)
speakers
})
然后在你的代码中使用这些ID。 (我只是在做第5在这个例子中)
ids <- unlist(ids)
# Scraping bios with names
sxsw_bios <- lapply(ids[1:5], function(y) {
doc <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", y))
# Getting speaker name
Name <- doc %>%
html_nodes(".speaker-name") %>%
html_text()
bio <- doc %>%
html_nodes(xpath = "//*[@class='row speaker-bio']") %>%
html_text()
list(name= Name, bio = bio)
})
sxsw_bios[[1]]
$name
# [1] "A$AP Rocky"
$bio
# [1] "A$AP Rocky is a cultural beacon that continues to ... <etc>
# ------------
sxsw_bios[[5]]
# $name
# [1] "Ken Abdo"
#
# $bio
# [1] "Ken Abdo is a partner at the national law firm of Fox Rothschild...<etc>