难以处理 Rvest 数据抓取中的缺失信息

问题描述 投票:0回答:1

我目前正在使用 rvest 包在 R 中开发一个网页抓取项目。虽然该包适用于从网站提取数据,但我在处理网页上丢失的信息时遇到了困难。

具体来说,当我使用 html_node 或 html_text 函数定位的某些元素不存在于页面上时,我的脚本在将行绑定到数据框时会中断。

我尝试过实现 tryCatch 和 if 语句等错误处理技术,但我正在努力寻找一种有效的方法来绕过丢失的元素并在数据框中添加现有信息。

有人可以提供有关如何在使用 rvest 抓取数据时优雅地处理丢失信息的指导或示例吗?我应该使用任何特定的功能或策略来实现这一目标吗?

预先感谢您的帮助!

网站:https://expresso.pt/api/molecule/search?q=trotinete&page=**PAGENUMBER**&offset=0 语言:R 输出:CSV

library(rvest)
library(dplyr)

expresso_scraper <- data.frame()

get_article <- function(link_article) {
  tryCatch({
    article_page <- read_html(link_article)
    print(link_article)
    
    article_text <- article_page %>% html_nodes("#article-body-1 span") %>% html_text() %>% paste()
    Sys.sleep(2)  # Pause for 2 seconds
    return(article_text)
  }, error = function(e) {
    message("Error occurred while scraping article: ", link_article)
    return(NA)
  })
  
}

search_term <- "trotinete"

#set_cookies()

for(page_result in seq(from = 1, to = 2)) {
  link <- paste0("https://expresso.pt/api/molecule/search?q=", search_term, "&page=", page_result, "&offset=0")     
  print(link)
  Sys.sleep(2)  # Pause for 2 seconds
  page <- read_html(link)
  
  if (grepl("href", page)) {
    title <- page %>% html_nodes(".title a") %>% html_text()
    link_article <- page %>% html_nodes(".title a") %>% html_attr("href") %>%
      ifelse(!grepl("\\.pt", .), paste("https://expresso.pt", ., sep = ""), .)
    date <- page %>% html_nodes(".timeStamp") %>%  html_attr("datetime") 
    section <- page %>% html_nodes(".mainSection") %>% html_text() %>% ifelse(length(.) == 0, NA, .)
    summary <- page %>% html_nodes(".lead") %>% html_text() %>% ifelse(length(.) == 0, NA, .)
    author <- page %>% html_nodes(".author") %>% html_text() %>% ifelse(length(.) == 0, NA, .)
    complete_article <- sapply(link_article, FUN = get_article, USE.NAMES = FALSE) 
    complete_article <- sapply(complete_article, paste, collapse = "\n") # Flattening the list
    expresso_scraper <- rbind(expresso_scraper, data.frame(title, link_article, date, section, author, summary, complete_article, stringsAsFactors = FALSE)) 
  }
}

author
write.csv2(expresso_scraper, "expresso.csv")
r csv web-scraping dplyr rvest
1个回答
0
投票

我个人只会循环遍历列表元素并单独解析它们,即,我们不为框架列构建向量,而是将框架行的项目收集到命名列表中。这样,无需检查丢失的元素或担心匹配向量长度,如果

rvest
在该特定
<li>
元素中找不到请求的元素,则该行的值将设置为
NA

有些链接位于付费专区后面,所以我不太确定那里发生了什么,但我认为完整的文章检索需要根据站点的不同而采取不同的方法,这就是为什么

get_article()
首先决定调用哪个解析器。
slow_*()
函数是有速率限制的,
purrr::slowly()
使得限制管道中的请求速率和变异调用变得非常方便。

library(rvest)
library(dplyr)
library(purrr)
library(stringr)

# request and parse search results
search_results <- function(search_term, page){
  read_html(str_glue("https://expresso.pt/api/molecule/search?q={search_term}&page={page}&offset=0")) |>
    html_elements("ul.listArticles > li") |>
    map(\(li) list(
      title        = html_element(li, ".title a") |> html_text(trim = TRUE),
      link_article = html_element(li, ".title a") |> html_attr("href"),
      date         = html_element(li, ".timeStamp") |> html_attr("datetime"),
      section      = html_element(li, ".mainSection") |> html_text(trim = TRUE),
      summary      = html_element(li, ".lead") |> html_text(trim = TRUE),
      author       = html_element(li, ".author") |> html_text(trim = TRUE)
    ))
}

# rate-limited search_results
slow_search <- slowly(search_results, rate = rate_delay(2))

# site-specific article parser
get_expresso <- function(link_article){
  read_html(link_article) |>
    html_elements("#article-body-1 > div > *") |>
    html_text(trim = TRUE) |> 
    str_c(collapse = "\n")
}

# rate-limited article parser
slow_expresso <- slowly(get_expresso, rate = rate_delay(2))

# use url to select & call apropriate site-specific parser
get_article <- function(link_article) {
  if (str_starts(link_article, "https://expresso.pt/")){
    return(slow_expresso(link_article))
  }else if(str_starts(link_article, "something else")){
    return("call some other site-specific parser")
  }else{
    return(NA)
  }
}

主循环:

search_term <- "trotinete"
pages <- 1:2

expresso_scraper <- 
  pages |>
  # 1st map() returns list of named list
  map(\(page) slow_search(search_term, page), .progress = TRUE) |>
  # 2nd map() returns list of data.frames / tibbles, one for page (2)
  map(bind_rows) |>
  # combine tibbles and also store origin page number
  list_rbind(names_to = "page") |>
  # instead of ifelse, let's just use str_replace on relative URLs that start with `/`
  mutate(link_article = str_replace(link_article, "^/",  "https://expresso.pt/")) |>
  # limit the number of articles to 5 in this example
  head(5) |>
  # get complete articles
  mutate(complete_article = map_chr(link_article, get_article, .progress = TRUE))

  
expresso_scraper
#> # A tibble: 5 × 8
#>    page title         link_article date  section summary author complete_article
#>   <int> <chr>         <chr>        <chr> <chr>   <chr>   <chr>  <chr>           
#> 1     1 Líderes na t… https://exp… 2023… Lídere… Mudanç… Franc… "1. Fiscalidade…
#> 2     1 Líderes na t… https://lei… 2023… Jornal… Mudanç… <NA>    <NA>           
#> 3     1 Neste hotel … https://exp… 2023… 20 Ano… Respir… BCBM   "Abriu em 2017 …
#> 4     1 “Cinto-me vi… https://exp… 2023… Socied… De aco… Lusa   "A campanha de …
#> 5     1 Condicioname… https://exp… 2023… Socied… As res… Lusa   "Os condicionam…

创建于 2024-04-24,使用 reprex v2.1.0

© www.soinside.com 2019 - 2024. All rights reserved.