如何收集特定<p>元素下的<h2>儿童

问题描述 投票:0回答:1

假设我正在尝试抓取像这个这样的文字记录。如果向下滚动,您会看到有一个

h2
元素,它既有文本“Transcript”,又有 id='transcript' 属性。如果我没记错的话,出现在
p
标题“下方”的
h2
元素实际上是它的同级元素,这就是为什么以下两个解决方案都不起作用的原因:

# using rvest

t %>% 
  html_elements('#transcript') %>% 
  html_children()

t %>% 
  html_elements('#transcript p')

那么,我如何获得那些

p
元素?

我尝试搜索以前的SO智慧,只发现BeautifulSoup用户提出的(有点)类似的问题。尽管如此,这似乎应该是一个基本问题,所以也许我比我想象的更离谱。

html r web-scraping rvest
1个回答
0
投票

这对你有用吗?请参阅评论以获取解释。

library(rvest)
library(xml2)

#read the page
url <- "https://80000hours.org/podcast/episodes/kevin-esvelt-stealth-wildfire-pandemics/"
page <- read_html(url)

#find the h2 elements
h2_elements <- page %>% html_elements('h2')
h2_text <- h2_elements %>% html_text()

#select the node with the word "Transcript
desired_h2 <- h2_elements[grep("Transcript", h2_text)]

#find the parent node of the desired h2
parent <- xml_parent(desired_h2)

#find all of the siblings "p" nodes under the parent
answer <- parent %>% html_elements("p") %>% html_text()

head(answer, 5)

[1] "Table of Contents"                                                                                                                                                                                                                                                                                                                                                            
[2] "Kevin Esvelt: So scientists correctly appreciate that, when there is controversy, you can get a paper in Nature, Science, or Cell — the top journals which are the best for your career."                                                                                                                                                                                     
[3] "Therefore, the incentives favour scientists identifying pandemic-capable viruses and determining whether posited cataclysmically destructive viruses and other forms of attack would actually function."                                                                                                                                                                      
[4] "And I have not seen any appreciable counter-incentives that could be anywhere near as powerful as the ones favouring our desire to know. Because almost all the time, it is better for us to know."                                                                                                                                                                           
[5] "So I don’t see many plausible futures in which we do not learn how to build agents that would bring down civilisation today. We just know that in the limit, if you get good enough at programming biology, we can do anything t
© www.soinside.com 2019 - 2024. All rights reserved.