我正在尝试从多个页面执行网页抓取。结构如下:初始 URL https://www.whosampled.com/Daft-Punk/sampled/?role=1 其中嵌套的是其他 URL,例如 https://www.whosampled.com/Daft-Punk/Harder,-更好,-更快,-更强/sampled/ 包含我需要的信息。我已经获得了从第二个 URL 中抓取数据的代码。挑战在于迭代所有这些 URL,并且使用 for 循环似乎有点麻烦(而且我不确定如何实现它)。此外,我在提供的代码中注意到 for 循环有更方便的替代方案。
这就是我所做的
page <- "https://www.whosampled.com/Daft-Punk/sampled/?role=1" %>%
read_html()
pages <- page %>%
html_elements(".page a") %>%
html_text2() %>%
last()
str_c("https://www.whosampled.com/Daft-Punk/sampled/?role=1&sp=", 1:pages) %>%
map(read_html) %>%
html_nodes(".trackCover")%>%
html_attr("href") %>%
paste0("https://www.whosampled.com", .,"sampled/") %>%
map(read_html) %>%
map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>%
map_dfr(~ tibble(
title = html_element(.x, ".trackName.playIcon") %>%
html_text2(),
artist = html_element(.x, ".tdata__td3") %>%
html_text2(),
year = html_element(.x, ".tdata__td3:nth-child(4)") %>%
html_text2(),
genre = html_element(.x, ".tdata__badge") %>%
html_text2()
)))
从这一点来看,想法是:
str_c("https://www.whosampled.com/Daft-Punk/sampled/?role=1&sp=", 1:pages)
换句话说,初始 URL,并从此 URL 构造其他 URL。但是,我遇到了这个错误:Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "list"
。我相信这是因为列表被传递给 html_nodes
而不是单个值。您有什么建议?
更新 我尝试了两种不同的方法从
开始setwd("C:/Users/c_ans/Desktop/bologna lezioni/comunication of statistics")
ualist<-read.table("useragents.txt", sep = "\n")
page <- read_html("https://www.whosampled.com/Daft-Punk/sampled/?role=1", user_agent=sample(ualist))
tracks_links <- page %>% html_nodes(".trackCover") %>%
html_attr("href") %>% paste0("https://www.whosampled.com", .,"
采样/”)
####方法1
for (i in length(tracks_links)){
page <- read_html(tracks_links[i], user_agent=sample(ualist))
pages <- page %>%
html_elements(".page a") %>%
html_text2() %>%
last()
paste0(tracks_links[i], "?cp=", 1:pages) %>%
#i need to add an user agent
map(~read_html(.,user_agent)) %>%
map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>%
map_dfr(~ tibble(
title = html_element(.x, ".trackName.playIcon") %>%
html_text2(),
artist = html_element(.x, ".tdata__td3") %>%
html_text2(),
year = html_element(.x, ".tdata__td3:nth-child(4)") %>%
html_text2(),
genre = html_element(.x, ".tdata__badge") %>%
html_text2()
))) -> tracks
}
#方法2
pages_list <- list()
for (i in seq_along(tracks_links)) {
page<- read_html(tracks_links[1], user_agent = sample(ualist))
pages <- page %>%
html_elements(".page a") %>%
html_text2() %>%
last()
pages_list[[i]] <- paste0(tracks_links[i], "?cp=", 1:as.numeric(pages))
}
tracks <- map_df(pages_list, ~read_html(.x) %>%
html_elements(".table.tdata tbody tr") %>%
map_dfr(~ tibble(
title = html_element(.x, ".trackName.playIcon") %>%
html_text2(),
artist = html_element(.x, ".tdata__td3") %>%
html_text2(),
year = html_element(.x, ".tdata__td3:nth-child(4)") %>%
html_text2(),
genre = html_element(.x, ".tdata__badge") %>%
html_text2()
)))
但现在我又遇到了错误 403 的问题:') 我尝试使用用户代理,但什么也没有