使用 R 进行多页面网页抓取

问题描述 投票:0回答:1

我正在尝试从多个页面执行网页抓取。结构如下:初始 URL https://www.whosampled.com/Daft-Punk/sampled/?role=1 其中嵌套的是其他 URL,例如 https://www.whosampled.com/Daft-Punk/Harder,-更好,-更快,-更强/sampled/ 包含我需要的信息。我已经获得了从第二个 URL 中抓取数据的代码。挑战在于迭代所有这些 URL,并且使用 for 循环似乎有点麻烦(而且我不确定如何实现它)。此外,我在提供的代码中注意到 for 循环有更方便的替代方案。

这就是我所做的

page <- "https://www.whosampled.com/Daft-Punk/sampled/?role=1" %>% 
  read_html() 

pages <- page %>% 
  html_elements(".page a") %>% 
  html_text2() %>% 
  last()

str_c("https://www.whosampled.com/Daft-Punk/sampled/?role=1&sp=", 1:pages) %>% 
  map(read_html) %>% 
  html_nodes(".trackCover")%>%
    html_attr("href") %>%
    paste0("https://www.whosampled.com", .,"sampled/") %>%
    map(read_html) %>%
  map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>% 
            map_dfr(~ tibble(
              title = html_element(.x, ".trackName.playIcon") %>% 
                html_text2(),
              artist = html_element(.x, ".tdata__td3") %>% 
                html_text2(),
              year = html_element(.x, ".tdata__td3:nth-child(4)") %>% 
                html_text2(),
              genre = html_element(.x, ".tdata__badge") %>% 
                html_text2()
            )))

从这一点来看,想法是:

str_c("https://www.whosampled.com/Daft-Punk/sampled/?role=1&sp=", 1:pages) 
换句话说,初始 URL,并从此 URL 构造其他 URL。但是,我遇到了这个错误:
Error in UseMethod("xml_find_all") :  no applicable method for 'xml_find_all' applied to an object of class "list"
。我相信这是因为列表被传递给
html_nodes
而不是单个值。您有什么建议?

r web-scraping rvest
1个回答
0
投票

更新 我尝试了两种不同的方法从

开始
setwd("C:/Users/c_ans/Desktop/bologna lezioni/comunication of statistics")
ualist<-read.table("useragents.txt", sep = "\n")

page <- read_html("https://www.whosampled.com/Daft-Punk/sampled/?role=1", user_agent=sample(ualist))
tracks_links <- page %>% html_nodes(".trackCover") %>%
    html_attr("href") %>% paste0("https://www.whosampled.com", .,"

采样/”)

####方法1

for (i in length(tracks_links)){
    page <- read_html(tracks_links[i], user_agent=sample(ualist))
    pages <- page %>% 
  html_elements(".page a") %>% 
  html_text2() %>% 
  last()
    paste0(tracks_links[i], "?cp=", 1:pages) %>%
    #i need to add an user agent
        map(~read_html(.,user_agent)) %>% 
        map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>% 
                    map_dfr(~ tibble(
                        title = html_element(.x, ".trackName.playIcon") %>% 
                            html_text2(),
                        artist = html_element(.x, ".tdata__td3") %>% 
                            html_text2(),
                        year = html_element(.x, ".tdata__td3:nth-child(4)") %>% 
                            html_text2(),
                        genre = html_element(.x, ".tdata__badge") %>% 
                            html_text2()
                    ))) -> tracks
}

#方法2

pages_list <- list()

for (i in seq_along(tracks_links)) {
  page<- read_html(tracks_links[1], user_agent = sample(ualist))
  pages <- page %>% 
    html_elements(".page a") %>% 
    html_text2() %>% 
    last()
  pages_list[[i]] <- paste0(tracks_links[i], "?cp=", 1:as.numeric(pages))
}

tracks <- map_df(pages_list, ~read_html(.x) %>%
                    html_elements(".table.tdata tbody tr") %>% 
                    map_dfr(~ tibble(
                      title = html_element(.x, ".trackName.playIcon") %>% 
                        html_text2(),
                      artist = html_element(.x, ".tdata__td3") %>% 
                        html_text2(),
                      year = html_element(.x, ".tdata__td3:nth-child(4)") %>% 
                        html_text2(),
                      genre = html_element(.x, ".tdata__badge") %>% 
                        html_text2()
                    )))

但现在我又遇到了错误 403 的问题:') 我尝试使用用户代理,但什么也没有

© www.soinside.com 2019 - 2024. All rights reserved.