使用rvest中的xpath刮取两个h5标题之间的内容吗？

Question

我正在使用rvest从本地html文件提取内容的过程。我想提取两个h5标题之间的特定内容段，唯一的“指定”详细信息是初始h5标题的文字标题。问题在于，文档的标题不同– ID和文本内容都可能有很多变化–唯一的例外是我感兴趣的文本标题“ Details”。请查看文档结构示例：

<div id=”document”>
<h3>Title of the document</h3>
<h4 id=”id11111”>Focus of the document</h4>
<p>This document focuses on…</p>
<p>And also…</p>
<h5 id=”id22222”> 1. Introduction </h5>
<p>Text here.</p>
<h6 id=”33333”> 1.1 Preliminary introduction </h6>
<p> Text here. </p>
<h5 id=”id44444”> 2. Details </h5>
<p>Text here.</p>
<h6 id=”id55555”> 2.1 Details about A </h6>
<p> Text here. </p>
<h6 id=”id66666”> 2.2 Details about B </h6>
<p> Text here. </p>
<h5 id=”id77777”> 3. Timeline </h5>
<p>Text here.</p>
<h6 id=”id88888”> 3.1 Timeline A </h5>
<p>Text here.</p>
</div>

从前面的示例中，我只想从h5标签中提取内容，内容为id44444，文本标题为“ 2. Details”，直到下一个h5标题（h5 id 77777、3，时间轴）。

我已经设法通过使用contains和following-sibling :: *从希望的h5标签（请参见下面的示例）开始抓取，但它会返回所有同级直到文档结尾，而我的目标是停止返回到以下h5标题。

我还没有弄清楚如何使用“前兄弟”，因为以下h5标签没有标准的id，xpath或文本内容，并且标题的顺序也不是标准的。 h5标题的显示顺序可能不同。

#loading rvest
library('rvest')


files <- list.files(”C:/htmldocuments”)

#performing the scrape
scraping <- sapply(files, function (x)
read_html(x, encoding = "utf-8") %>%
html_nodes(xpath = '//h5[contains(., ”Details”)]/following-sibling::*') %>%
html_text())

这将返回从正确位置开始的结果，但是如何将其停止到“ Details”的h5标签之后的下一个h5标签呢？以下h5标签的ID和标题有所不同，因此未知。

我已经审查了多个类似的问题–答案通常指向使用前一个同级–但是我似乎无法弄清楚如何使用它，因为我无法知道后面的h5是什么。

Answer 1

您可以使用以下XPath表达式：

//p[preceding::*[1][contains(.,"Details")]]

这将选择所有p元素，其后是包含单词“ Details”的标题元素。

输出：3个节点

如果需要保留标题，可以使用：

//*[preceding::*[1][contains(.,"Details")] or contains(text(),"Details")]

输出：6个节点

使用rvest中的xpath刮取两个h5标题之间的内容吗？

问题描述投票：0回答：1

1个回答

最新问题

使用rvest中的xpath刮取两个h5标题之间的内容吗？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1