如何使用 HtmlAgilityPack 更好地解析同级内容

问题描述 投票:0回答:1

我需要从这个 HTML 中提取各种信息。

在完美的世界中,我会有一些可以使用的辅助属性,但由于某些原因,我坚持使用这种结构并处理混乱的情况。

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
</body>
</html>

我正在做这个:

 public static void Parse(string html)
 {
     var document = new HtmlDocument();
     document.LoadHtml(html);
     var paragraphs = new List < string > ();
     var heading = string.Empty;
     var nodes = document.DocumentNode.SelectNodes("//p");
     for (int i = 0; i < nodes.Count; i++)
     {
         var paragraphNode = nodes[i];
         paragraphs.Add(paragraphNode.InnerText.Trim() + Environment.NewLine);
     }
 }

paragraphNode.NextSibling
不包含 UL - 能够解析此内容的最佳方法是什么?

我需要谨慎,因为 UL 必须构成前一段的一部分,所以这是一个内容块:

<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>

这是下一个内容块:

<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>

我无法更改 HTML 的结构或依赖任何其他内容。这样做是否有点理智?

c# .net html-agility-pack
1个回答
0
投票

您可能想尝试这个扩展: https://github.com/hcesar/HtmlAgilityPack.CssSelector

 document.QuerySelectorAll("p, li");
© www.soinside.com 2019 - 2024. All rights reserved.