通过网页抓取提取文本:使用多个可选的开始/结束字符串循环

问题描述 投票:0回答:1

我想抓取一些新闻声明的文本。 我目前遇到的问题是定义几个字符串,其中文本的抓取应该开始/结束。例如,起点(字符串)应该是“下午好,欢迎参加我们的新闻发布会”。或“女士们先生们欢迎收看今天的报告”。对于网络抓取的终点,我还想不仅提供一个字符串,而且提供多个选项。

这是我到目前为止的代码。它仅适用于作为起点或终点的一个特定字符串。有人可以帮助我并解释如何修改帽子代码,以便它适用于至少一个字符串作为开始/结束点。谢谢!

library(rvest)
library(tidyverse)
library(mclm)
library(tm)
url_list <- c("https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2024/html/ecb.is240411~9974984b58.en.html","https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2024/html/ecb.is240307~314650bd5c.en.html")
# Function to extract text between two specific strings    
extract_text <- function(text, start_str, end_str) {
  start_pos <- grep(start_str, text)
  end_pos <- grep(end_str, text)
  if (length(start_pos) > 0 && length(end_pos) > 0) {
    text[start_pos:end_pos]
  } else {
    NA  # Return NA if start or end string not found
  }
}

# Initialize list to store extracted statements
ECBstatements <- vector("list", length(url_list))
ECB_dates <- vector("list", length(url_list))
ECB_titles<- vector("list", length(url_list))
# Loop through each URL
for (i in 1:length(url_list)) {
  myLink <- url_list[i]
  Page_ECB <- read_html(myLink)
  #Extract the title
  ECBtitles<-Page_ECB %>%html_elements(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "ecb-pressContentSubtitle", " " ))]') %>% 
    html_text()
  #Extract the date
  ECBdates<-Page_ECB %>%html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "ecb-publicationDate", " " ))]') %>%
    html_text2()                               
  
  # Extract the text
  text <- Page_ECB %>%
    html_nodes('p') %>%
    html_text2()
  
  # Extract text between specific strings
  start_str <- "to our press conference" # Specify your start string 
  end_str <- "questions."# Specify your end string 
  extracted_text <- extract_text(text, start_str, end_str)
  
  ECBstatements[[i]] <- extracted_text
  ECB_dates[[i]]<-ECBdates
  ECB_titles[[i]]<-ECBtitles

 
}


# Show the extracted statements
ECBstatements
ECB_dates
ECB_titles
r loops web-scraping text-mining
1个回答
0
投票

使用 {stringr} 包,您可以使用 正则表达式 来捕获文本。这意味着您可以使用

.*
表达式来捕获所有文本,分别在前面和后面加上开始字符串和结束字符串,以便它将捕获两个字符串之间的所有文本。最重要的是,您可以使用“或”符号
|
来指定应匹配的多个字符串。

library(dplyr)
library(stringr)

start_strings_0 <- c("to our press conference", "to this press conference", "to the conference", "to our conference")
end_strings_0 <- c("questions", "question", "does anybody have a comment", "does anybody have a question", "the end", "bye")

start_strings <- paste(start_strings_0, collapse = "|")
end_strings <- paste(end_strings_0, collapse = "|")

extract_string <- paste0("(", start_strings, ")", ".*", "(", end_strings, ")")

text <- "Good afternoon welcome to our press conference, today we will talk about why Wombats are the cutest animals. Before we start, does anybody have a question? Because this will be your last chance to speak today."

text2 <- "Hello everybody to the conference. This conference will focus on the huge national issue that are mischievous raccoons roaming our streets. Any questions? None? Ok, we'll begin."

str_extract(text, extract_string)
str_extract(text2, extract_string)
© www.soinside.com 2019 - 2024. All rights reserved.