忽略不存在的URLS，继续进行搜刮。

Question

我是一个新的网页抓取和Rvest包。我想完成的是将下面网站的新闻内容搜刮出来。http:/www.xwlbo.com31035.html我注意到历史新闻有数字索引的模式，但我后来发现数字索引是随机的，没有明确的规则，因此，可能会有不存在的网页，我得到了一个错误，即 Error in open.connection(x, "rb") : HTTP error 404.. 我如何才能忽略空的网页，并继续使用确实存在的网页。

这是我目前得出的结果。

library(tidyverse)
library(lubridate)
library(stringr)
library(rvest)
Sys.setlocale(category="LC_ALL",locale="chinese")

web_index_number <- 4058:31106

urls <- str_c("http://www.xwlbo.com/",web_index_number,".html")



news_collect <- function(x){
  webpage <- read_html(x)
  wp_title <- html_node(webpage,'h2') %>% 
   html_text()
wp_content <- html_nodes(webpage,'p , a , h2') %>% 
   html_text()
len <- length(wp_content)-3
wp_content <- wp_content[1:len]
wp_title <- rep(wp_title,len)
news <- data.frame(wp_title,wp_content)}

news_collection <- map_df(urls,news_collect)

任何帮助将是非常感激的。

谢谢，Felix

Answer 1

你可以使用一个trycatch结构，在这个结构中，你可以尝试执行以下列代码为开头的代码 新闻_收藏. 如果 read_html(x) 失败，你可以直接写入错误代码来打印错误并返回NULL。

忽略不存在的URLS，继续进行搜刮。

问题描述投票：0回答：1

1个回答

最新问题

忽略不存在的URLS，继续进行搜刮。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1