基于特定节点将大型 XML 文件拆分为多个

问题描述 投票:0回答:2

我正在尝试根据特定节点将一个大型 XML 文档(有 88,645 行)拆分为多个 XML。该特定节点是

<project>
。大XML文档结构如下:

<?xml version="1.0" encoding="UTF-8"?>
<projects>
  <project>
    <projectNumber>738951</projectNumber>
    <projectType>CHANGE REQUEST</projectType>
    <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
  </project>    

我的目标是将文档拆分成如下所示:

XML 1:

<?xml version="1.0" encoding="UTF-8"?>
<project>
   <projectNumber>738951</projectNumber>
   <projectType>CHANGE REQUEST</projectType>
   <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
</project>    

XML 2:

<?xml version="1.0" encoding="UTF-8"?>
<project>
   <projectNumber>738951</projectNumber>
   <projectType>CHANGE REQUEST</projectType>
   <lineOfBusiness>COMMERCIAL</lineOfBusiness>
     ...
</project>

等等。虽然,我不想编写 XML 代码,而是想将实际的(大型)XML 文档提供给它。

以下是我基于编写 XML 代码的初始代码(但同样,我想向 Python 提供要读取的实际 XML 文档):

import xml.etree.ElementTree as ET

xml = '''<projects>
  <project>
    <projectNumber>738951</projectNumber>
    <projectType>CHANGE REQUEST</projectType>
    <lineOfBusiness>COMMERCIAL</lineOfBusiness>
    ...
'''

root = ET.fromstring(xml)
counter = 1

for child in list(root):
    if child.tag.startswith('project'):
        src = ET.Element('project')
        src.append(child)
        with open(f'out_{counter}.xml','w') as f:
            tree = ET.ElementTree(src)
            tree.write(f,encoding="unicode")
        counter += 1
python xml elementtree
2个回答
0
投票

您可以使用

XMLPullParser
作为非阻塞工具并部分解析每个项目分支:

import xml.etree.ElementTree as ET

parser = ET.XMLPullParser(['start', 'end']) # other  events are comment, pi, start-ns, end-ns

with open("Large.xml", 'r') as f_xml:
    for line in f_xml:
        parser.feed(line)

for event, elem in parser.read_events():
    if event == "end" and elem.tag == "project":
        for tag_elem in elem.iter():
            if tag_elem.tag == "projectNumber":
                print(tag_elem.text)
            if tag_elem.tag == "projectType":
                print(tag_elem.text) 
            if tag_elem.tag == "lineOfBusiness":
                print(tag_elem.text)

0
投票

在 XSLT 2.0 或更高版本中,这是:

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
   version="2.0">
<xsl:template match="/">
  <xsl:for-each select="/*/project">
    <xsl:result-document href="proj{position()}.xml">
       <xsl:copy-of select="."/>
    </xsl:result-document>
  </xsl:for-each>
</xsl:template>
</xsl:transform>
© www.soinside.com 2019 - 2024. All rights reserved.