如何使用 HtmlAgilityPack 更好地解析同级内容

Question

我需要从这个 HTML 中提取各种信息。

在完美的世界中，我会有一些可以使用的辅助属性，但由于某些原因，我坚持使用这种结构并处理混乱的情况。

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
</body>
</html>

我正在做这个：

 public static void Parse(string html)
 {
     var document = new HtmlDocument();
     document.LoadHtml(html);
     var paragraphs = new List < string > ();
     var heading = string.Empty;
     var nodes = document.DocumentNode.SelectNodes("//p");
     for (int i = 0; i < nodes.Count; i++)
     {
         var paragraphNode = nodes[i];
         paragraphs.Add(paragraphNode.InnerText.Trim() + Environment.NewLine);
     }
 }

paragraphNode.NextSibling

不包含 UL - 能够解析此内容的最佳方法是什么？

我需要谨慎，因为 UL 必须构成前一段的一部分，所以这是一个内容块：

<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>

这是下一个内容块：

<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>

我无法更改 HTML 的结构或依赖任何其他内容。这样做是否有点理智？

Answer 1

您可能想尝试这个扩展： https://github.com/hcesar/HtmlAgilityPack.CssSelector

 document.QuerySelectorAll("p, li");

如何使用 HtmlAgilityPack 更好地解析同级内容

问题描述投票：0回答：1

1个回答

最新问题

如何使用 HtmlAgilityPack 更好地解析同级内容

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1