使用 lxml find() 方法查找 xml 文件中的元素

Question

我的 xml 文件长度超过 100 万行。我可以使用

BeautifulSoup

毫无问题地解析它们，但使用

bs4

进行解析可能需要一分钟或更长时间。我正在尝试使用 lxml 进行解析，希望能够显着加快速度，但我根本无法使用

find()

方法。

最终我希望用 lxml 代码替换此 bs4 行：

datamanagers = soup.find_all('Field', {'Name': 'DataManager'})

我根本无法让

find()

方法发挥作用。我想我可以从小处开始并获得 root 中的第一个元素。示例 xml 文件如下所示：

<?xml version="1.0" encoding="utf-8"?>
<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="http://www.spotfire.com/schemas/Document1.0.xsd" SchemaVersion="1.0">
    <Object Id="1">
        <Type>
        ...
        </Type>
    </Object>
</Root>

所以我尝试：

with open(path_work + '\\' + file.stem + '\\' + 'AnalysisDocument.xml') as f:
    tree = etree.parse(f)
    root = tree.getroot()

tree.find('Object')
root.find('Object')
tree.find('.//Object')
root.find('.//Object')

我尝试的一切都会有回报

None

。我在这里做错了什么？我查看了大量的答案，它们都使

find()

功能看起来非常简单易用。

Answer 1

您显示的示例具有默认命名空间

http://www.spotfire.com/schemas/Document1.0.xsd

，因此这可能就是您找不到

Object

元素的原因。

您（至少）有两个选择：

首先，也是我的偏好，您可以将该名称空间 uri 绑定到前缀并在 xpath 中使用它：

ns = {"s": "http://www.spotfire.com/schemas/Document1.0.xsd"}

root.find("./s:Object", namespaces=ns)

第二种是使用 Clark 表示法在 xpath 中指定 uri 和名称：

root.find("./{http://www.spotfire.com/schemas/Document1.0.xsd}Object")

此外，您不必像这样打开文件：

with open(path_work + '\\' + file.stem + '\\' + 'AnalysisDocument.xml') as f:

您可以仅使用

.parse()

与路径而不是文件对象：

tree = etree.parse(path_work + '\\' + file.stem + '\\' + 'AnalysisDocument.xml')

使用 lxml find() 方法查找 xml 文件中的元素

问题描述投票：0回答：1

1个回答

最新问题

使用 lxml find() 方法查找 xml 文件中的元素

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1