将root.xpath（）与regex一起使用将返回lxml.etree._ElementUnicodeResult

Question

我正在生成一个模型，以找出一段文本在HTML文件中的位置。

因此，我有一个数据库，其中包含来自不同报纸文章的大量数据，包括标题，发布日期，作者和新闻文本等数据。我要做的是通过分析这些数据，生成一个模型，该模型可以自己找到带有此内容的HTML标签的XPath。

问题是当我在xpath方法中使用正则表达式时，如下所示：

from lxml import html

with open('somecode.html', 'r') as f:
    root = html.fromstring(f.read())

list_of_xpaths = root.xpath('//*/@*[re:match(.,"2019-04-15")]')

这是在代码中搜索发布日期的示例。它返回lxml.etree._ElementUnicodeResult而不是lxml.etree._Element。

不幸的是，在应用root.getroottree().getpath(list_of_xpaths[0])之后，这种类型的元素不允许我将XPath定位到它位于lxml.etree._Element的位置。

有没有办法获得这种类型的元素的XPath？怎么样？

有没有办法让lxml与正则表达式返回一个lxml.etree._ElementUnicodeResult元素？

Answer 1

问题是您获得的属性值表示为_ElementUnicodeResult类的实例。

如果我们反省_ElementUnicodeResult类提供的内容，我们可以看到它允许您通过.getparent()方法获取具有此属性的元素：

attribute = list_of_xpaths[0]
element = attribute.getparent()

print(root.getroottree().getpath(element))

这将为我们提供元素的路径，但由于我们还需要一个属性名称，我们可以这样做：

print(attribute.attrname)

然后，为了获得指向element属性的完整xpath，我们可以使用：

path_to_element = root.getroottree().getpath(element)
attribute_name = attribute.attrname

complete_path = path_to_element + "/@" + attribute_name
print(complete_path)

仅供参考，_ElementUnicodeResult还通过.is_attribute属性指示它是否实际上是一个属性（因为此类也表示文本节点和尾部）。

将root.xpath（）与regex一起使用将返回lxml.etree._ElementUnicodeResult

问题描述投票：1回答：1

1个回答

最新问题

将root.xpath（）与regex一起使用将返回lxml.etree._ElementUnicodeResult

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1