我想抓取一些新闻声明的文本。 我目前遇到的问题是定义几个字符串,其中文本的抓取应该开始/结束。例如,起点(字符串)应该是“下午好,欢迎参加我们的新闻发布会”。或“女士们先生们欢迎收看今天的报告”。对于网络抓取的终点,我还想不仅提供一个字符串,而且提供多个选项。
这是我到目前为止的代码。它仅适用于作为起点或终点的一个特定字符串。有人可以帮助我并解释如何修改帽子代码,以便它适用于至少一个字符串作为开始/结束点。谢谢!
library(rvest)
library(tidyverse)
library(mclm)
library(tm)
url_list <- c("https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2024/html/ecb.is240411~9974984b58.en.html","https://www.ecb.europa.eu/press/press_conference/monetary-policy-statement/2024/html/ecb.is240307~314650bd5c.en.html")
# Function to extract text between two specific strings
extract_text <- function(text, start_str, end_str) {
start_pos <- grep(start_str, text)
end_pos <- grep(end_str, text)
if (length(start_pos) > 0 && length(end_pos) > 0) {
text[start_pos:end_pos]
} else {
NA # Return NA if start or end string not found
}
}
# Initialize list to store extracted statements
ECBstatements <- vector("list", length(url_list))
ECB_dates <- vector("list", length(url_list))
ECB_titles<- vector("list", length(url_list))
# Loop through each URL
for (i in 1:length(url_list)) {
myLink <- url_list[i]
Page_ECB <- read_html(myLink)
#Extract the title
ECBtitles<-Page_ECB %>%html_elements(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "ecb-pressContentSubtitle", " " ))]') %>%
html_text()
#Extract the date
ECBdates<-Page_ECB %>%html_elements(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "ecb-publicationDate", " " ))]') %>%
html_text2()
# Extract the text
text <- Page_ECB %>%
html_nodes('p') %>%
html_text2()
# Extract text between specific strings
start_str <- "to our press conference" # Specify your start string
end_str <- "questions."# Specify your end string
extracted_text <- extract_text(text, start_str, end_str)
ECBstatements[[i]] <- extracted_text
ECB_dates[[i]]<-ECBdates
ECB_titles[[i]]<-ECBtitles
}
# Show the extracted statements
ECBstatements
ECB_dates
ECB_titles
使用 {stringr} 包,您可以使用 正则表达式 来捕获文本。这意味着您可以使用
.*
表达式来捕获所有文本,分别在前面和后面加上开始字符串和结束字符串,以便它将捕获两个字符串之间的所有文本。最重要的是,您可以使用“或”符号 |
来指定应匹配的多个字符串。
library(dplyr)
library(stringr)
start_strings_0 <- c("to our press conference", "to this press conference", "to the conference", "to our conference")
end_strings_0 <- c("questions", "question", "does anybody have a comment", "does anybody have a question", "the end", "bye")
start_strings <- paste(start_strings_0, collapse = "|")
end_strings <- paste(end_strings_0, collapse = "|")
extract_string <- paste0("(", start_strings, ")", ".*", "(", end_strings, ")")
text <- "Good afternoon welcome to our press conference, today we will talk about why Wombats are the cutest animals. Before we start, does anybody have a question? Because this will be your last chance to speak today."
text2 <- "Hello everybody to the conference. This conference will focus on the huge national issue that are mischievous raccoons roaming our streets. Any questions? None? Ok, we'll begin."
str_extract(text, extract_string)
str_extract(text2, extract_string)