晚上好,我正在开发一个数据可视化项目,我需要来自网站的一些数据。它涉及大量的观察,我想到做一些抓取来加快这个过程。这是我第一次使用抓取功能,但我在 YouTube 上找到了一个有用的视频,为我提供了代码。一切都很好,直到我收到以下错误:“read_html.response(link) 中的错误:禁止 (HTTP 403)。”由此,我了解到该网站可能不允许抓取。
因此,我尝试使用具有以下代码的用户代理:
user.agent<-"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
link<-GET("https://www.whosampled.com/Daft-Punk/Harder,-Better,-Faster,-Stronger/sampled/", user_agent(user.agent))
page<-read_html(link)
但是仍然出现同样的错误。有人有什么建议吗?
这是一个开始:
library(tidyverse)
library(rvest)
page <- "https://www.whosampled.com/Daft-Punk/Harder,-Better,-Faster,-Stronger/sampled/?cp=2" %>%
read_html()
pages <- page %>%
html_elements(".page a") %>%
html_text2() %>%
last()
str_c("https://www.whosampled.com/Daft-Punk/Harder,-Better,-Faster,-Stronger/sampled/?cp=", 1:pages) %>%
map(read_html) %>%
map_dfr(~ html_elements(.x, ".table.tdata tbody tr") %>%
map_dfr(~ tibble(
title = html_element(.x, ".trackName.playIcon") %>%
html_text2(),
artist = html_element(.x, ".tdata__td3") %>%
html_text2(),
year = html_element(.x, ".tdata__td3:nth-child(4)") %>%
html_text2(),
genre = html_element(.x, ".tdata__badge") %>%
html_text2()
)))
# A tibble: 77 × 4
title artist year genre
<chr> <chr> <chr> <chr>
1 Stronger Kanye West 2007 Vocals / Lyrics
2 Boom Boom Pow Black Eyed Peas 2009 Vocals / Lyrics
3 Overdose EXO 2014 Vocals / Lyrics
4 Harder, Better, Faster, Stronger Bashy 2007 Multiple Elements
5 Daft Punk Is Playing at My House (Soulwax Shibuya Mix) LCD Soundsystem 2004 Sound FX / Other
6 Face to Face / Short Circuit Daft Punk 2007 Multiple Elements
7 Harder Better Faster Stronger (Deadmau5 Edit) deadmau5 2007 Multiple Elements
8 Work Is Never Over Diplo 2007 Vocals / Lyrics
9 Let Me See You Girl Talk 2008 Vocals / Lyrics
10 Make It Faster Cruz and the White 2004 Vocals / Lyrics
# ℹ 67 more rows
# ℹ Use `print(n = ...)` to see more rows