R：用rvest抓动动态链接

Question

我正在尝试使用rvest从位于“动态”日历下的互联网档案中获取RSS源的链接，请参阅this link作为示例。

<div>
<div class="captures">
<div class="position" style="width: 20px; height: 20px;">
<div class="measure ">
</div>
</div>
<a href="/web/20100112114601/http://www.dailyecho.co.uk/news/district/winchester/rss/">12</a>
</div>
<!-- react-empty: 2310 --></div>

例如，

url %>% 
  read_html() %>%
  html_nodes("a") %>% 
  html_attr("href")

不返回我感兴趣的链接，xpath或html_nodes('.captures')返回空结果。任何提示都会非常有用，谢谢！

Answer 1

一种可能性是使用wayback包（GL）（GH），它支持查询Internet Archive并读取已保存页面的HTML（“mementos”）。您可以通过http://www.mementoweb.org/guide/quick-intro/和https://mementoweb.org/guide/rfc/作为入门资源，研究更多abt web归档术语（它有点神秘的IMO）。

library(wayback) # devtools::install_git(one of the superscript'ed links above)
library(rvest) # for reading the resulting HTML contents
library(tibble) # mostly for prettier printing of data frames

人们可以采取多种方法。这是我在在线内容的取证分析中倾向于做的事情。因人而异。

首先，我们得到录制的纪念品（基本上是相关内容的简短列表）：

(rss <- get_mementos("http://www.dailyecho.co.uk/news/district/winchester/rss/"))
## # A tibble: 7 x 3
##   link                                                             rel       ts                 
##   <chr>                                                            <chr>     <dttm>             
## 1 http://www.dailyecho.co.uk/news/district/winchester/rss/         original  NA                 
## 2 http://web.archive.org/web/timemap/link/http://www.dailyecho.co… timemap   NA                 
## 3 http://web.archive.org/web/http://www.dailyecho.co.uk/news/dist… timegate  NA                 
## 4 http://web.archive.org/web/20090517035444/http://www.dailyecho.… first me… 2009-05-17 03:54:44
## 5 http://web.archive.org/web/20180712045741/http://www.dailyecho.… prev mem… 2018-07-12 04:57:41
## 6 http://web.archive.org/web/20180812213013/http://www.dailyecho.… memento   2018-08-12 21:30:13
## 7 http://web.archive.org/web/20180812213013/http://www.dailyecho.… last mem… 2018-08-12 21:30:13

IA的日历菜单查看器实际上是“时间图”。我喜欢使用它，因为它是所有爬行的时间点纪念列表。这是上面的第二个链接，所以我们将在以下内容中阅读：

(tm <- get_timemap(rss$link[2]))
## # A tibble: 45 x 5
##    rel           link                                  type        from          datetime       
##    <chr>         <chr>                                 <chr>       <chr>         <chr>          
##  1 original      http://www.dailyecho.co.uk:80/news/d… NA          NA            NA             
##  2 self          http://web.archive.org/web/timemap/l… applicatio… Sun, 17 May … NA             
##  3 timegate      http://web.archive.org                NA          NA            NA             
##  4 first memento http://web.archive.org/web/200905170… NA          NA            Sun, 17 May 20…
##  5 memento       http://web.archive.org/web/200908130… NA          NA            Thu, 13 Aug 20…
##  6 memento       http://web.archive.org/web/200911121… NA          NA            Thu, 12 Nov 20…
##  7 memento       http://web.archive.org/web/201001121… NA          NA            Tue, 12 Jan 20…
##  8 memento       http://web.archive.org/web/201007121… NA          NA            Mon, 12 Jul 20…
##  9 memento       http://web.archive.org/web/201011271… NA          NA            Sat, 27 Nov 20…
## 10 memento       http://web.archive.org/web/201106290… NA          NA            Wed, 29 Jun 20…
## # ... with 35 more rows

内容在纪念品中，在日历视图中应该有尽可能多的纪念品。我们将在第一篇中阅读：

mem <- read_memento(tm$link)
# Ideally use writeLines(), now, to save this to disk with a good
# filename. Alternatively, stick it in a data frame with metadata 
# and saveRDS() it. But, that's not a format others (outside R) can 
# use so perhaps do the data frame thing and stream it out as ndjson
# with jsonlite::stream_out() and compress it during save or afterwards.

然后将它转换为我们可以使用xml2::read_xml()或xml2::read_html()编程的东西（RSS有时更好地解析为XML）：

read_html(mem)
## {xml_document}
## <html>
## [1] <body><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Daily Ec ...

read_memento()有一个as参数来自动解析结果，但我喜欢在本地存储纪念品（如评论中所述），以免滥用IA服务器（即如果我需要再次获取数据，我不需要打他们的基础设施）。

一个很大的警告是，如果你试图在短时间内从IA获得太多资源，你将暂时被禁止，因为它们有规模，但它是免费服务，他们（理所当然地）试图防止滥用。

绝对是文件问题包（选择你最喜欢的源代码托管社区这样做，因为我会使用，但在微软收购GitHub后更喜欢GitLab）如果有什么不清楚或你觉得可以做得更好。它不是一个受欢迎的包装，我偶尔需要法医探险，所以它“适合我”，但我很乐意尝试让它更加用户友好（我只需要知道痛点）。

R：用rvest抓动动态链接

问题描述投票：1回答：1

1个回答

最新问题

R：用rvest抓动动态链接

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1