我需要从这个 HTML 中提取各种信息。
在完美的世界中,我会有一些可以使用的辅助属性,但由于某些原因,我坚持使用这种结构并处理混乱的情况。
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p>Paragraph text.</p>
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
</body>
</html>
我正在做这个:
public static void Parse(string html)
{
var document = new HtmlDocument();
document.LoadHtml(html);
var paragraphs = new List < string > ();
var heading = string.Empty;
var nodes = document.DocumentNode.SelectNodes("//p");
for (int i = 0; i < nodes.Count; i++)
{
var paragraphNode = nodes[i];
paragraphs.Add(paragraphNode.InnerText.Trim() + Environment.NewLine);
}
}
paragraphNode.NextSibling
不包含 UL - 能够解析此内容的最佳方法是什么?
我需要谨慎,因为 UL 必须构成前一段的一部分,所以这是一个内容块:
<p><strong>Heading 1 text I can extract</strong><br />Paragraph text - this is where the exctraction ends for this paragraph/strong, I need to inclulde the list (and any other content before the next paragraph/strong)</p>
<ul>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
<li>I need to pull out these list items;</li>
</ul>
这是下一个内容块:
<p><strong>Heading 2 text I can extract</strong><br />Paragraph text - this extracts fine</p>
我无法更改 HTML 的结构或依赖任何其他内容。这样做是否有点理智?
您可能想尝试这个扩展: https://github.com/hcesar/HtmlAgilityPack.CssSelector
document.QuerySelectorAll("p, li");