从网页抓取中提取文本

问题描述 投票:0回答:1

我正在尝试从网站获取文本 我的代码可以工作(有点)

for (i in 1:no_urls) {
  this_url=urls_meetings[[i]]
  page=read_html(this_url)
  
  text=page |> html_elements("body") |> html_text2()
  text_date=text[1]
  date<- str_extract(text_date, "\\b\\w+ \\d{1,2}, \\d{4}\\b")
  # Convert the abbreviated month name to its full form
  date_str <- gsub("^(.*)\\s(\\d{1,2}),\\s(\\d{4})$", "\\1 \\2, \\3", date)

  # Convert to Date object
  date <- mdy(date_str)
  date_1=as.character(date)
  date_1=gsub("-", "", date_1)


  text=text[2]
  statements_list2[[i]]=text
  names(statements_list)[i] <- date_1

}

问题是如果行

的输出
text=page |> html_elements("body") |> html_text2()

这给了我页面的全部文本

[1] "\r \r\r \r\nRelease Date: January 29, 2003\r\n\n\n\n\n\r For immediate release\r\n\n\r\n\n\r\r\n\n\r The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future. \r\n\n\r Voting for the FOMC monetary policy action were Alan Greenspan, Chairman; William J. McDonough, Vice Chairman; Ben S. Bernanke, Susan S. Bies; J. Alfred Broaddus, Jr.; Roger W. Ferguson, Jr.; Edward M. Gramlich; Jack Guynn; Donald L. Kohn; Michael H. Moskow; Mark W. Olson, and Robert T. Parry. \r \r \r\n\n\r -----------------------------------------------------------------------------------------\r DO NOT REMOVE: Wireless Generation\r ------------------------------------------------------------------------------------------\r 2003 Monetary policy \r\n\nHome | News and \r events\nAccessibility\r\n\r Last update: January 29, 2003\r\r \r\n(function(){if (!document.body) return;var js = \"window['__CF$cv$params']={r:'8775c6b49a2a2015',t:'MTcxMzYyMjgzOC41MjIwMDA='};_cpo=document.createElement('script');_cpo.nonce='',_cpo.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js',document.getElementsByTagName('head')[0].appendChild(_cpo);\";var _0xh = document.createElement('iframe');_0xh.height = 1;_0xh.width = 1;_0xh.style.position = 'absolute';_0xh.style.top = 0;_0xh.style.left = 0;_0xh.style.border = 'none';_0xh.style.visibility = 'hidden';document.body.appendChild(_0xh);function handler() {var _0xi = _0xh.contentDocument || _0xh.contentWindow.document;if (_0xi) {var _0xj = _0xi.createElement('script');_0xj.innerHTML = js;_0xi.getElementsByTagName('head')[0].appendChild(_0xj);}}if (document.readyState !== 'loading') {handler();} else if (window.addEventListener) {document.addEventListener('DOMContentLoaded', handler);} else {var prev = document.onreadystatechange || function () {};document.onreadystatechange = function (e) {prev(e);if (document.readyState !== 'loading') {document.onreadystatechange = prev;handler();}};}})();"


我只需要保留相关文本。我尝试过各种事情

str_extract(text, "(?<=The Federal Open Market)(.*?)(?=Voting)")


 str_match(text, "The Federal Open Market(.*?)Voting")

但是他们都给了我一个空字符作为回报

理想的输出是

The Federal Open Market Committee decided today to keep its target for the federal funds rate unchanged at 1-1/4 percent. \r\n\n\r Oil price premiums and other aspects of geopolitical risks have reportedly fostered continued restraint on spending and hiring by businesses. However, the Committee believes that as those risks lift, as most analysts expect, the accommodative stance of monetary policy, coupled with ongoing growth in productivity, will provide support to an improving economic climate over time.\r\n\n\r In these circumstances, the Committee believes that, against the background of its long-run goals of price stability and sustainable economic growth and of the information currently available, the risks are balanced with respect to the prospects for both goals for the foreseeable future.

html r web-scraping rvest
1个回答
0
投票

您的模式不起作用的原因是您的字符串中有新行。 . 元字符的

定义
是它匹配除换行符之外的任何字符

这是一个较短的示例:

txt <- "there are some\r\nwords here"
str_extract(txt, "some.+words")
# [1] NA

使用

stringr::str_extract()
解决此问题的方法是:

您可以通过设置

.
来允许
\n
匹配所有内容,包括
dotall = TRUE

str_extract(txt, regex("some.+words", dotall = TRUE))
# [1] "some\r\nwords"

或者对于你的字符串:

str_extract(text, regex("(?<=The Federal Open Market)(.*?)(?=Voting)", dotall = TRUE))
© www.soinside.com 2019 - 2024. All rights reserved.