来自lxml python的fromstring函数出错

问题描述 投票:-1回答:1

尝试做

import lxml.etree

xml_str = """
<root>
<H4>

</H4>
<P>
Hong Kong, February 06, 2020 -- </P>
<P>
&bull; Testing data only
</P>
</root>
"""

utf8_parser = lxml.etree.XMLParser(encoding='utf-8')
metadata_xml = lxml.etree.fromstring("""<root>""" + xml_str + """</root>""",
                                     parser=utf8_parser)

我遇到错误:

 File "src\lxml\etree.pyx", line 3236, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1757, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1068, in lxml.etree._BaseParser._parseUnicodeDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 9
lxml.etree.XMLSyntaxError: Entity 'bull' not defined, line 9, column 7

谁知道我该如何解决?

python xml lxml
1个回答
0
投票

如jordanm所评论,请使用HTML解析器而不是XML解析器。

import lxml.etree

xml_str = r"""
<root>
<H4>

</H4>
<P>
Hong Kong, February 06, 2020 -- </P>
<P>
&bull; Testing data only
</P>
</root>
"""

html_parser = lxml.etree.HTMLParser()

metadata_xml = lxml.etree.fromstring("""<root>""" + xml_str + """</root>""", 
                                     parser=html_parser)

如果您坚持使用XML解析器,则可以像这样取消对&bull;字符实体引用的转义:

import html

xml_str = html.unescape(xml_str)
© www.soinside.com 2019 - 2024. All rights reserved.