我有以下 XML 文件,想要从 Excel 电子表格导入数据,并将其放置在某些元素(例如
eadid
和 titleproper
)之间。我已经尝试了附加的 Python 代码,但这生成了一个不包含完整架构的 XML 文件。我想将 Excel 文件中的每一行数据保存到单独的 XML 文件中。
<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid></eadid>
<filedesc>
<titlestmt>
<titleproper></titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn"></language>
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid></unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn"></language>
</langmaterial>
<unittitle></unittitle>
<unitdate normal=""></unitdate>
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc>
</dsc>
</archdesc>
</ead>
这是我在Python中尝试过的代码:
import pandas as pd
pip install openpyxl
from lxml import etree as et
import xml.etree.ElementTree as Xet
tree = et.parse('test_resource.xml')
root = tree.getroot()
raw_data = pd.read_excel(r'/Users/smeyerkukan/Desktop/ArchivesSpace/Python Coding/aspace.xlsx')
tree = et.parse('test_resource.xml')
root = tree.getroot()
for row in raw_data.iterrows():
root_tags = et.SubElement(root, 'ExportData')
Column_heading_1 = et.SubElement(root_tags, 'titleproper')
Column_heading_2 = et.SubElement(root_tags, 'unittitle')
Column_heading_3 = et.SubElement(root_tags, 'eadid')
Column_heading_4 = et.SubElement(root_tags, 'unitid')
Column_heading_7 = et.SubElement(root_tags, 'unitdate')
Column_heading_1.text = str(row[1]['<titleproper>'])
Column_heading_2.text = str(row[1]['<unittitle>'])
Column_heading_3.text = str(row[1]['<eadid>'])
Column_heading_4.text = str(row[1]['<unitid>'])
Column_heading_7.text = str(row[1]['<unitdate>'])
tree = et.ElementTree(root)
et.indent(tree, space="\t", level=0)
tree.write('output.xml', encoding="utf-8")
这是我想要得到的输出:
<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid>Church RG 095</eadid>
<filedesc>
<titlestmt>
<titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn"></language>
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid>Church RG 095</unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn"></language>
</langmaterial>
<unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
<unitdate normal="1811/2003"></unitdate>
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc>
</dsc>
</archdesc>
</ead>
这是我收到的输出:
<ExportData>
<titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
<unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
<eadid>Church RG 095</eadid>
<unitid>Church RG 095</unitid>
<unitdate>1811/2003</unitdate>
</ExportData>
<ExportData>
<titleproper>First Zion Lehigh Union Church (Alburtis, Pa.)</titleproper>
<unittitle>First Zion Lehigh Union Church (Alburtis, Pa.)</unittitle>
<eadid>Church RG 001</eadid>
<unitid>Church RG 001</unitid>
<unitdate>1843/2019</unitdate>
</ExportData>
<ExportData>
<titleproper>Christ Reformed Church (Allentown, PA)</titleproper>
<unittitle>Christ Reformed Church (Allentown, PA)</unittitle>
<eadid>Church RG 002</eadid>
<unitid>Church RG 002</unitid>
<unitdate>1876/1982</unitdate>
</ExportData>
这是我正在使用的 Excel 文件的屏幕截图。总共有 916 行:
如果您可以将 Excel 保存为 csv 文件,则不需要 pandas。 我建议使用一个函数来填充空的 xml 模板并将其保存到磁盘,并使用行号作为文件 id (注意/问题:每个 XML 中是否真的始终具有相同的 urn:isbn:1-931666-22-9 ?)
import xml.etree.ElementTree as ET
import csv
def create_XML(row, no):
# Your Template ead.xml will be modified with row data.
# Register namespaces
# ns = dict(node for event, node in ET.iterparse('ead.xml', events = ['start-ns']))
ns = {'': 'urn:isbn:1-931666-22-9', 'xsi': 'http://www.w3.org/2001/XMLSchema-instance'}
for prefix, uri in ns.items():
ET.register_namespace(prefix, uri)
tree = ET.parse('ead.xml')
root = tree.getroot()
titleproper = root.find('.//titleproper', ns)
titleproper.text = row[0]
unittitle = root.find('.//unittitle', ns)
unittitle.text = row[1]
eadid = root.find('.//eadid', ns)
eadid.text = row[2]
unitid = root.find('.//unitid', ns)
unitid.text = row[3]
unitdate = root.find('.//unitdate', ns)
unitdate.attrib['normal'] = row[4]
new_tree = ET.ElementTree(root)
ET.indent(root, space=' ')
new_tree.write(f"{no}_book.xml", xml_declaration=True)
print(f"{no}_book.xml written!")
# Export the Excel to csv with ';' - delimiter e.g:
# <titleproper>;<unittitle> ;<eadid> ;<unitid> ;BeginDate;EndDate;<unitdate>
# Row_No for filename of each CSV row
no = 1
with open("Biblio.csv", newline='') as csvfile:
book = csv.reader(csvfile, delimiter=' ', quotechar='|')
next(book, None)
for row in book:
line = ' '.join(row).split(';')
# Excel column0:<titleproper>; column1:<unittitle>; column2:<eadid>; column3:<unitid>; column6:<unitdate>
row = [line[0], line[1], line[2], line[3], line[6]]
create_XML(row, no)
no += 1
第一行输出示例文件:
<?xml version='1.0' encoding='us-ascii'?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid>Church RG 001</eadid>
<filedesc>
<titlestmt>
<titleproper>First Zion Lehigh Union Church (Alburtis, PA.)</titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn" />
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid>Church RG 001</unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn" />
</langmaterial>
<unittitle>First Zion Lehigh Union Church (Alburtis, PA.)</unittitle>
<unitdate normal="1843/2019" />
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc />
</archdesc>
</ead>
我的 CSV 看起来像:
<titleproper>;<unittitle> ;<eadid> ;<unitid> ;BeginDate;EndDate;<unitdate>
Salem Reformed Church (Philadelphia, PA);Salem Reformed Church (Philadelphia, PA) ;Church RG 095 ;Church RG 095 ;1811;2003;1811/2003
First Zion Lehigh Union Church (Alburtis, PA.);First Zion Lehigh Union Church (Alburtis, PA.);Church RG 001;Church RG 001;1843;2019;1843/2019