在lxml中定义默认命名空间（未加前缀）

Question

当使用lxml渲染XHTML时，一切都很好，除非你碰巧使用Firefox，它似乎无法处理名称空间前缀的XHTML元素和javascript。虽然Opera能够执行javascript（这同样适用于jQuery和MathJax），但无论XHTML命名空间是否有前缀（在我的情况下是h:），在Firefox中脚本都将以奇怪的错误中止（this.head未定义）在MathJax的情况下）。

我知道register_namespace函数，但它既不接受None也不接受""作为命名空间前缀。我在_namespace_map模块中听说过lxml.etree，但我的Python抱怨这个属性不存在（版本问题？）

是否还有其他方法可以删除XHTML命名空间的名称空间前缀？请注意，str.replace，如在另一个相关问题的答案中所建议的，不是我可以接受的方法，因为它不了解XML语义并且可能很容易搞砸了生成的文档。

根据要求，您将找到两个可供使用的示例。一个与namespace prefixes和one without。第一个将在Firefox中显示0（错误），第二个将显示1（正确）。 Opera将使两者都正确。这显然是a Firefox bug，但这仅作为想要使用lxml的无前缀XHTML的基本原理 - 还有其他很好的理由来减少移动客户端的流量等（如果考虑数十或hdret的html标签，甚至h:也相当多）。

Answer 1

This XSL transformation从content中删除所有前缀，同时维护根节点中定义的名称空间：

import lxml.etree as ET

content = '''\
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<h:html xmlns:h="http://www.w3.org/1999/xhtml" xmlns:ml="http://foo">
  <h:head>
    <h:title>MathJax Test Page</h:title>
    <h:script type="text/javascript"><![CDATA[
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    ]]></h:script>
  </h:head>
  <h:body onload="test();">
    <h:p>test</h:p>
    <ml:foo></ml:foo>
  </h:body>
</h:html>
'''
dom = ET.fromstring(content)

xslt = '''\
<xsl:stylesheet version="1.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<!-- identity transform for everything else -->
<xsl:template match="/|comment()|processing-instruction()|*|@*">
    <xsl:copy>
      <xsl:apply-templates />
    </xsl:copy>
</xsl:template>

<!-- remove NS from XHTML elements -->
<xsl:template match="*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()" />
    </xsl:element>
</xsl:template>

<!-- remove NS from XHTML attributes -->
<xsl:template match="@*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="." />
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc = ET.fromstring(xslt)
transform = ET.XSLT(xslt_doc)
dom = transform(dom)

print(ET.tostring(dom, pretty_print = True, 
                  encoding = 'utf-8'))

产量

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>MathJax Test Page</title>
    <script type="text/javascript">
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    </script>
  </head>
  <body onload="test();">
    <p>test</p>
    <ml:foo xmlns:ml="http://foo"/>
  </body>
</html>

Answer 2

使用ElementMaker并给它一个nsmap，将None映射到你的默认命名空间。

#!/usr/bin/env python
# dogeml.py

from lxml.builder import ElementMaker
from lxml import etree

E = ElementMaker(
    nsmap={
        None: "http://wow/"    # <--- This is the special sauce
    }
)

doge = E.doge(
    E.such('markup'),
    E.many('very namespaced', syntax="tricks")
)

options = {
    'pretty_print': True,
    'xml_declaration': True,
    'encoding': 'UTF-8',
}

serialized_bytes = etree.tostring(doge, **options)
print(serialized_bytes.decode(options['encoding']))

正如您在此脚本的输出中所看到的，定义了默认命名空间，但标签没有前缀。

<?xml version='1.0' encoding='UTF-8'?>
<doge xmlns="http://wow/">
   <such>markup</such>
   <many syntax="tricks">very namespaced</many>
</doge>

我已经使用Python 2.7.6,3.3.5和3.4.0以及lxml 3.3.1测试了这段代码。

Answer 3

要扩展@ neirbowj的答案，但是使用ET.Element和ET.SubElement，并使用混合的命名空间呈现文档，其中根恰好是显式命名空间，子元素（channel）是默认命名空间：

# I set up but don't use the default namespace:
root = ET.Element('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF', nsmap={None: 'http://purl.org/rss/1.0/'})
# I use the default namespace by including its URL in curly braces:
e = ET.SubElement(root, '{http://purl.org/rss/1.0/}channel')
print(ET.tostring(root, xml_declaration=True, encoding='utf8').decode())

这将打印出以下内容：

<?xml version='1.0' encoding='utf8'?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><channel/></rdf:RDF>

它自动将rdf用于RDF名称空间。我不确定它是如何计算出来的。如果我想指定它，我可以将它添加到根元素中的nsmap：

nsmap = {None: 'http://purl.org/rss/1.0/',
         'doge': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
root = ET.Element('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF', nsmap=nsmap)
e = ET.SubElement(root, '{http://purl.org/rss/1.0/}channel')
print(ET.tostring(root, xml_declaration=True, encoding='utf8').decode())

......我明白了：

<?xml version='1.0' encoding='utf8'?>
<doge:RDF xmlns:doge="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/"><channel/></doge:RDF>

在lxml中定义默认命名空间（未加前缀）

问题描述投票：2回答：3

3个回答

最新问题

在lxml中定义默认命名空间（未加前缀）

问题描述 投票：2回答：3

3个回答

最新问题

问题描述投票：2回答：3