在R中URL的末尾找到号码的范围

问题描述 投票:1回答:1

我试图让这个链接末尾的数字的范围:https://schedule.sxsw.com/2019/speakers/2008434

链路具有多个在端部,例如该2008434。该链接指向音箱在即将举行的西南偏南音乐节的BIOS。我知道有3729个音箱达尔,但这并不能帮助我弄清楚每个扬声器及其相关页面是如何编号。

我试图做使用lapply函数一些简单的网页抓取,但如果我不能指定一个范围我的功能不起作用。例如,我用:

number_range <- seq(1:3000000)

点击链接周围没有给出模式,他们是如何编号。

而我得到了很多Error in open.connection(x, "rb") : HTTP error 404.

有一个简单的方法来获得这个范围/启用此功能工作?下面的代码:

library(rvest)
library(tidyverse)

# List for bios
sxsw_bios <- list()

# Creating vector of numbers
number_range <- seq(1:3000000)

# Scraping bios with names
sxsw_bios <- lapply(number_range, function(y) {

# Getting speaker name
Name <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", 
                       paste0(y))) %>% 
  html_nodes(".speaker-name") %>% 
  html_text()
r lapply
1个回答
2
投票

你可以刮的ID从扬声器页面列表

library(rvest)

ids <- lapply( letters, function(x) {
  speakers <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/alpha/", x)) %>%
    rvest::html_nodes(xpath = "//*[@class='favorite-click absolute']/@data-item-id")

  speakers <- gsub(' data-item-id="|"',"",speakers)
  speakers
})

然后在你的代码中使用这些ID。 (我只是在做第5在这个例子中)

ids <- unlist(ids)

# Scraping bios with names
sxsw_bios <- lapply(ids[1:5], function(y) {

    doc <- read_html(paste0("https://schedule.sxsw.com/2019/speakers/", y))

  # Getting speaker name
  Name <- doc %>% 
    html_nodes(".speaker-name") %>% 
    html_text()

  bio <- doc %>%
    html_nodes(xpath = "//*[@class='row speaker-bio']") %>%
    html_text()
  list(name= Name, bio = bio)
})

sxsw_bios[[1]]

$name
# [1] "A$AP Rocky"

$bio
# [1] "A$AP Rocky is a cultural beacon that continues to ... <etc>

# ------------

sxsw_bios[[5]]

# $name
# [1] "Ken Abdo"
# 
# $bio
# [1] "Ken Abdo is a partner at the national law firm of Fox Rothschild...<etc>
© www.soinside.com 2019 - 2024. All rights reserved.