我正在尝试根据特定节点将一个大型 XML 文档(有 88,645 行)拆分为多个 XML。该特定节点是
<project>
。大XML文档结构如下:
<?xml version="1.0" encoding="UTF-8"?>
<projects>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
我的目标是将文档拆分成如下所示:
XML 1:
<?xml version="1.0" encoding="UTF-8"?>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
XML 2:
<?xml version="1.0" encoding="UTF-8"?>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
</project>
等等。虽然,我不想编写 XML 代码,而是想将实际的(大型)XML 文档提供给它。
以下是我基于编写 XML 代码的初始代码(但同样,我想向 Python 提供要读取的实际 XML 文档):
import xml.etree.ElementTree as ET
xml = '''<projects>
<project>
<projectNumber>738951</projectNumber>
<projectType>CHANGE REQUEST</projectType>
<lineOfBusiness>COMMERCIAL</lineOfBusiness>
...
'''
root = ET.fromstring(xml)
counter = 1
for child in list(root):
if child.tag.startswith('project'):
src = ET.Element('project')
src.append(child)
with open(f'out_{counter}.xml','w') as f:
tree = ET.ElementTree(src)
tree.write(f,encoding="unicode")
counter += 1
您可以使用
XMLPullParser
作为非阻塞工具并部分解析每个项目分支:
import xml.etree.ElementTree as ET
parser = ET.XMLPullParser(['start', 'end']) # other events are comment, pi, start-ns, end-ns
with open("Large.xml", 'r') as f_xml:
for line in f_xml:
parser.feed(line)
for event, elem in parser.read_events():
if event == "end" and elem.tag == "project":
for tag_elem in elem.iter():
if tag_elem.tag == "projectNumber":
print(tag_elem.text)
if tag_elem.tag == "projectType":
print(tag_elem.text)
if tag_elem.tag == "lineOfBusiness":
print(tag_elem.text)
在 XSLT 2.0 或更高版本中,这是:
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<xsl:template match="/">
<xsl:for-each select="/*/project">
<xsl:result-document href="proj{position()}.xml">
<xsl:copy-of select="."/>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:transform>