使用复制模式在 Python 中进行 XML 解析

问题描述 投票:0回答:1

我有以下 xml 文件,想要从 Excel 电子表格导入数据,并将其放置在某些元素(例如 eadid 和 titleproper)之间。我已经尝试了附加的 Python 代码,但这生成了一个不包含完整架构的 xml 文件。我还想将 Excel 文件中的每一行数据保存到单独的 xml 文件中。

<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
    <eadid></eadid>
    <filedesc>
    <titlestmt>
        <titleproper></titleproper>
    </titlestmt>
    </filedesc>
    <profiledesc>
        <langusage>
            <language langcode="eng" scriptcode="Latn"></language>
        </langusage>
    </profiledesc>
    </eadheader>
        <archdesc level="fonds">
            <did>
                <unitid></unitid>
                <langmaterial>
                    <language langcode="eng" scriptcode="Latn"></language>
                </langmaterial>
                <unittitle></unittitle>
                <unitdate normal=""></unitdate>
                <physdesc>
                    <extent>1 Cubic Feet</extent>
                </physdesc>
                <langmaterial>
                    <language langcode="eng" scriptcode="Latn">English</language>
                </langmaterial>
            </did>
            <dsc>
            </dsc>
        </archdesc>
</ead>

这是我在Python中尝试过的代码:

import pandas as pd
pip install openpyxl
from lxml import etree as et
import xml.etree.ElementTree as Xet
tree = et.parse('test_resource.xml')
root = tree.getroot()
raw_data = pd.read_excel(r'/Users/smeyerkukan/Desktop/ArchivesSpace/Python Coding/aspace.xlsx')
tree = et.parse('test_resource.xml')
root = tree.getroot()
for row in raw_data.iterrows():
    root_tags = et.SubElement(root, 'ExportData')
    Column_heading_1 = et.SubElement(root_tags, 'titleproper')
    Column_heading_2 = et.SubElement(root_tags, 'unittitle')
    Column_heading_3 = et.SubElement(root_tags, 'eadid')
    Column_heading_4 = et.SubElement(root_tags, 'unitid')
    Column_heading_7 = et.SubElement(root_tags, 'unitdate')
    Column_heading_1.text = str(row[1]['<titleproper>'])
    Column_heading_2.text = str(row[1]['<unittitle>'])
    Column_heading_3.text = str(row[1]['<eadid>'])
    Column_heading_4.text = str(row[1]['<unitid>'])
    Column_heading_7.text = str(row[1]['<unitdate>'])
tree = et.ElementTree(root)
et.indent(tree, space="\t", level=0)
tree.write('output.xml', encoding="utf-8")

这是我想要得到的输出:

<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
    <eadid>Church RG 095</eadid>
    <filedesc>
    <titlestmt>
        <titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
    </titlestmt>
    </filedesc>
    <profiledesc>
        <langusage>
            <language langcode="eng" scriptcode="Latn"></language>
        </langusage>
    </profiledesc>
    </eadheader>
        <archdesc level="fonds">
            <did>
                <unitid>Church RG 095</unitid>
                <langmaterial>
                    <language langcode="eng" scriptcode="Latn"></language>
                </langmaterial>
                <unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
                <unitdate normal="1811/2003"></unitdate>
                <physdesc>
                    <extent>1 Cubic Feet</extent>
                </physdesc>
                <langmaterial>
                    <language langcode="eng" scriptcode="Latn">English</language>
                </langmaterial>
            </did>
            <dsc>
            </dsc>
        </archdesc>
</ead>

这是我收到的输出:

    <ExportData>
    <titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
    <unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
    <eadid>Church RG 095</eadid>
    <unitid>Church RG 095</unitid>
    <unitdate>1811/2003</unitdate>
</ExportData>
<ExportData>
    <titleproper>First Zion Lehigh Union Church (Alburtis, Pa.)</titleproper>
    <unittitle>First Zion Lehigh Union Church (Alburtis, Pa.)</unittitle>
    <eadid>Church RG 001</eadid>
    <unitid>Church RG 001</unitid>
    <unitdate>1843/2019</unitdate>
</ExportData>
<ExportData>
    <titleproper>Christ Reformed Church (Allentown, PA)</titleproper>
    <unittitle>Christ Reformed Church (Allentown, PA)</unittitle>
    <eadid>Church RG 002</eadid>
    <unitid>Church RG 002</unitid>
    <unitdate>1876/1982</unitdate>
</ExportData>

Here is a screenshot of the Excel file I am using. There are 916 rows in total

python xml parsing
1个回答
0
投票

如果您可以将 Excel 保存为 csv 文件,则不需要 pandas。 我建议使用一个函数来填充空的 xml 模板并将其保存到磁盘,并将行号作为文件 ID:

import xml.etree.ElementTree as ET
import csv

def create_XML(row, no):
    # Your Template ead.xml
    # Register namespaces could maybe improved, because you don't need search it each time.
    ns = dict(node for event, node in ET.iterparse('ead.xml', events = ['start-ns']))
    for prefix, uri in ns.items():
        ET.register_namespace(prefix, uri)
        
    tree = ET.parse('ead.xml')
    root = tree.getroot()
    
    titleproper = root.find('.//titleproper', ns)
    titleproper.text = row[0]
  
    unittitle = root.find('.//unittitle', ns)
    unittitle.text = row[1]
    
    eadid = root.find('.//eadid', ns)
    eadid.text = row[2]
    
    unitid = root.find('.//unitid', ns)
    unitid.text = row[3]
    
    unitdate = root.find('.//unitdate', ns)
    unitdate.text = row[4]
    
    tree = ET.ElementTree(root)
    ET.indent(root, space='  ')
    tree.write(f"{no}_book.xml", xml_declaration=True)
    print(f"{no}_book.xml written!")
    

# Export the Excel to csv with ';' - delimiter e.g:
# <titleproper>;<unittitle> ;<eadid> ;<unitid> ;BeginDate;EndDate;<unitdate>
# Row_No for filename of each CSV row
no = 1
with open("Biblio.csv", newline='') as csvfile:
    book = csv.reader(csvfile, delimiter=' ', quotechar='|')
    next(book, None)
    for row in book:
        line = ' '.join(row).split(';')
        # Excel column0:<titleproper>; column1:<unittitle>; column2:<eadid>; column3:<unitid>; column6:<unitdate>
        row = [line[0], line[1], line[2], line[3], line[6]]
        create_XML(row, no)
        no += 1

输出:

<?xml version='1.0' encoding='us-ascii'?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
  <eadheader>
    <eadid>Church RG 095 </eadid>
    <filedesc>
      <titlestmt>
        <titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
      </titlestmt>
    </filedesc>
    <profiledesc>
      <langusage>
        <language langcode="eng" scriptcode="Latn" />
      </langusage>
    </profiledesc>
  </eadheader>
  <archdesc level="fonds">
    <did>
      <unitid>Church RG 095 </unitid>
      <langmaterial>
        <language langcode="eng" scriptcode="Latn" />
      </langmaterial>
      <unittitle>Salem Reformed Church (Philadelphia, PA) </unittitle>
      <unitdate normal="">1811/2003 </unitdate>
      <physdesc>
        <extent>1 Cubic Feet</extent>
      </physdesc>
      <langmaterial>
        <language langcode="eng" scriptcode="Latn">English</language>
      </langmaterial>
    </did>
    <dsc />
  </archdesc>
</ead>
© www.soinside.com 2019 - 2024. All rights reserved.