在Python元素树中使用XPath通过部分匹配的标签查找所有元素。

Question

我试图找到XHTML ElementTree中的所有标题元素，我想知道是否有办法用XPath来实现这个目标。

<body>
  <h1>title</h1>
  <h2>heading 1</h2>
  <p>text</p>
  <h3>heading 2</h3>
  <p>text</p>
  <h2>heading 3</h2>
  <p>text</p>
</body>

我的目的是让所有的标题元素按顺序排列，而天真的解决方案并不奏效。

for element in tree.iterfind("h*"):
  foo(element)

因为它们应该是按顺序排列的，我不能逐个遍历每个标题元素。

headings = {f"h{n}" for n in range(1, 6+1)}

for heading in headings:
  for element in tree.iterfind(heading):
    foo(element)

(但 for element in filter(lambda el: el.tag in headings, tree.iterfind()) 著作)

而且我不能使用regex，因为它在注释上会失效（不使用字符串标签）。

import re
pattern = re.compile("^h[1-6]$")
is_heading = lambda el: pattern.match(el.tag)

for element in filter(is_heading, tree.iterfind()):
  foo(element)

(但 is_heading = lambda el: isinstance(el.tag, str) and pattern.match(el.tag) 著作)

没有一个解决方案是特别优雅的，所以我想知道是否有更好的方法来使用xpath按顺序找到所有的标题元素？

Answer 1

像这样。

//*[self::h1 or self::h2 or self::h3]

Answer 2

如果你能使用lxml，你就能使用... ... 联合运算符 |...

from lxml import etree

xml = """
<body>
  <h1>title</h1>
  <h2>heading 1</h2>
  <p>text</p>
  <h3>heading 2</h3>
  <p>text</p>
  <h2>heading 3</h2>
  <p>text</p>
</body>
"""

tree = etree.fromstring(xml)

for elm in tree.xpath("//h1|//h2|//h3"):
    print(elm.text)

打印输出...

title
heading 1
heading 2
heading 3

lxml还允许你使用 self:: 像另一个答案中提到的那样，如果你愿意的话，轴。

Answer 3

另一种方法。

from simplified_scrapy import SimplifiedDoc,req,utils
html ='''
<body>
  <h1>title</h1>
  <h2>heading 1</h2>
  <p>text</p>
  <h3>heading 2</h3>
  <p>text</p>
  <h2>heading 3</h2>
  <p>text</p>
</body>'''
doc = SimplifiedDoc(html)
hs = doc.getElementsByReg('h[1-9]')
print(hs.text)

结果。

['title', 'heading 1', 'heading 2', 'heading 3']

Answer 4

这个XPath也应该可以用

'//*[starts-with(name(), "h") and not(translate(substring(name(),string-length(name())), "0123456789", ""))]'

在Python元素树中使用XPath通过部分匹配的标签查找所有元素。

问题描述投票：0回答：1

1个回答

最新问题

在Python元素树中使用XPath通过部分匹配的标签查找所有元素。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1