我有以下 xml 文件,想要从 Excel 电子表格导入数据,并将其放置在某些元素(例如 eadid 和 titleproper)之间。我已经尝试了附加的 Python 代码,但这生成了一个不包含完整架构的 xml 文件。我还想将 Excel 文件中的每一行数据保存到单独的 xml 文件中。
<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid></eadid>
<filedesc>
<titlestmt>
<titleproper></titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn"></language>
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid></unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn"></language>
</langmaterial>
<unittitle></unittitle>
<unitdate normal=""></unitdate>
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc>
</dsc>
</archdesc>
</ead>
这是我在Python中尝试过的代码:
import pandas as pd
pip install openpyxl
from lxml import etree as et
import xml.etree.ElementTree as Xet
tree = et.parse('test_resource.xml')
root = tree.getroot()
raw_data = pd.read_excel(r'/Users/smeyerkukan/Desktop/ArchivesSpace/Python Coding/aspace.xlsx')
tree = et.parse('test_resource.xml')
root = tree.getroot()
for row in raw_data.iterrows():
root_tags = et.SubElement(root, 'ExportData')
Column_heading_1 = et.SubElement(root_tags, 'titleproper')
Column_heading_2 = et.SubElement(root_tags, 'unittitle')
Column_heading_3 = et.SubElement(root_tags, 'eadid')
Column_heading_4 = et.SubElement(root_tags, 'unitid')
Column_heading_7 = et.SubElement(root_tags, 'unitdate')
Column_heading_1.text = str(row[1]['<titleproper>'])
Column_heading_2.text = str(row[1]['<unittitle>'])
Column_heading_3.text = str(row[1]['<eadid>'])
Column_heading_4.text = str(row[1]['<unitid>'])
Column_heading_7.text = str(row[1]['<unitdate>'])
tree = et.ElementTree(root)
et.indent(tree, space="\t", level=0)
tree.write('output.xml', encoding="utf-8")
这是我想要得到的输出:
<?xml version="1.0" encoding="UTF-8"?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid>Church RG 095</eadid>
<filedesc>
<titlestmt>
<titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn"></language>
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid>Church RG 095</unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn"></language>
</langmaterial>
<unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
<unitdate normal="1811/2003"></unitdate>
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc>
</dsc>
</archdesc>
</ead>
这是我收到的输出:
<ExportData>
<titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
<unittitle>Salem Reformed Church (Philadelphia, PA)</unittitle>
<eadid>Church RG 095</eadid>
<unitid>Church RG 095</unitid>
<unitdate>1811/2003</unitdate>
</ExportData>
<ExportData>
<titleproper>First Zion Lehigh Union Church (Alburtis, Pa.)</titleproper>
<unittitle>First Zion Lehigh Union Church (Alburtis, Pa.)</unittitle>
<eadid>Church RG 001</eadid>
<unitid>Church RG 001</unitid>
<unitdate>1843/2019</unitdate>
</ExportData>
<ExportData>
<titleproper>Christ Reformed Church (Allentown, PA)</titleproper>
<unittitle>Christ Reformed Church (Allentown, PA)</unittitle>
<eadid>Church RG 002</eadid>
<unitid>Church RG 002</unitid>
<unitdate>1876/1982</unitdate>
</ExportData>
如果您可以将 Excel 保存为 csv 文件,则不需要 pandas。 我建议使用一个函数来填充空的 xml 模板并将其保存到磁盘,并将行号作为文件 ID:
import xml.etree.ElementTree as ET
import csv
def create_XML(row, no):
# Your Template ead.xml
# Register namespaces could maybe improved, because you don't need search it each time.
ns = dict(node for event, node in ET.iterparse('ead.xml', events = ['start-ns']))
for prefix, uri in ns.items():
ET.register_namespace(prefix, uri)
tree = ET.parse('ead.xml')
root = tree.getroot()
titleproper = root.find('.//titleproper', ns)
titleproper.text = row[0]
unittitle = root.find('.//unittitle', ns)
unittitle.text = row[1]
eadid = root.find('.//eadid', ns)
eadid.text = row[2]
unitid = root.find('.//unitid', ns)
unitid.text = row[3]
unitdate = root.find('.//unitdate', ns)
unitdate.text = row[4]
tree = ET.ElementTree(root)
ET.indent(root, space=' ')
tree.write(f"{no}_book.xml", xml_declaration=True)
print(f"{no}_book.xml written!")
# Export the Excel to csv with ';' - delimiter e.g:
# <titleproper>;<unittitle> ;<eadid> ;<unitid> ;BeginDate;EndDate;<unitdate>
# Row_No for filename of each CSV row
no = 1
with open("Biblio.csv", newline='') as csvfile:
book = csv.reader(csvfile, delimiter=' ', quotechar='|')
next(book, None)
for row in book:
line = ' '.join(row).split(';')
# Excel column0:<titleproper>; column1:<unittitle>; column2:<eadid>; column3:<unitid>; column6:<unitdate>
row = [line[0], line[1], line[2], line[3], line[6]]
create_XML(row, no)
no += 1
输出:
<?xml version='1.0' encoding='us-ascii'?>
<ead xmlns="urn:isbn:1-931666-22-9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:isbn:1-931666-22-9 http://www.loc.gov/ead/ead.xsd">
<eadheader>
<eadid>Church RG 095 </eadid>
<filedesc>
<titlestmt>
<titleproper>Salem Reformed Church (Philadelphia, PA)</titleproper>
</titlestmt>
</filedesc>
<profiledesc>
<langusage>
<language langcode="eng" scriptcode="Latn" />
</langusage>
</profiledesc>
</eadheader>
<archdesc level="fonds">
<did>
<unitid>Church RG 095 </unitid>
<langmaterial>
<language langcode="eng" scriptcode="Latn" />
</langmaterial>
<unittitle>Salem Reformed Church (Philadelphia, PA) </unittitle>
<unitdate normal="">1811/2003 </unitdate>
<physdesc>
<extent>1 Cubic Feet</extent>
</physdesc>
<langmaterial>
<language langcode="eng" scriptcode="Latn">English</language>
</langmaterial>
</did>
<dsc />
</archdesc>
</ead>