python lxml xpath 查询在硬编码的 url 上失败，但适用于字节字符串

Question

我正在尝试从

parsable-cite

标签中提取 xml 属性

text

。我正在从 url“https://www.congress.gov/118/bills/hr61/BILLS-118hr61ih.xml”解析 xml。

我正在使用的代码如下（在此处复制https://replit.com/join/ohhztxpqdr-aam88）并在这里编写以方便起见：

from lxml import etree
import requests

response = requests.get(url)
xml_response = response.content

tree = etree.fromstring(xml_response)
result = tree.xpath("//text[contains(., 'is amended')]")

for r in result:
  external_xref = r.find("external-xref")
  print(external_xref.attrib)

我收到一条错误消息，表明我正在访问

None

并且 xpath 未找到搜索。

AttributeError: 'NoneType' object has no attribute 'attrib'

当我使用相同的代码并直接使用文本节点的片段时，我得到以下结果：

text = b’<text display-inline="no-display-inline">Section 4702 of the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act (<external-xref legal-doc="usc" parsable-cite="usc/18/249">18 U.S.C. 249</external-xref> note) is amended by adding at the end the following: </text>’

tree = etree.fromstring(text)
result = tree.xpath("//text[contains(., 'is amended')]")

for r in result:
  external_xref = r.find("external-xref")
  print(external_xref.attrib)

{'legal-doc': 'usc', 'parsable-cite': 'usc/18/249'}

问题似乎来自于直接处理 url 中的内容。关于如何继续的任何建议？

谢谢

Answer 1

在 https://www.congress.gov/118/bills/hr61/BILLS-118hr61ih.xml 中，有两个

text

元素包含字符串“is Modified”。但其中只有一个（第二个）有

external-xref

子元素。

以下代码更新将避免错误并产生所需的输出：

for r in result:
    external_xref = r.find("external-xref")
    if external_xref is not None:    # Check if there actually is an external-xref
        print(external_xref.attrib)

python lxml xpath 查询在硬编码的 url 上失败，但适用于字节字符串

问题描述投票：0回答：1

1个回答

最新问题

python lxml xpath 查询在硬编码的 url 上失败，但适用于字节字符串

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1