在 Python 中将 XML 解析为 CSV

问题描述 投票:0回答:1

我正在尝试编写一个 Python 脚本来解析 XML 文件并为 XML 中的每个表生成 CSV 文件。每个表都应包含其属性。此外,我想创建一个 CSV 文件来表示这些表之间的关系。

这是我的代码:

import xml.etree.ElementTree as ET
import csv

def extract_tables_and_attributes(xml_file):
    parser = ET.XMLParser(encoding="windows-1252")
    tree = ET.parse(xml_file, parser=parser)
    root = tree.getroot()

    tables = root.findall(".//{http://www.omg.org/spec/UML/20090901}Class")
    
    table_data = []
    
    for table in tables:
        table_name = table.find("{http://www.omg.org/spec/UML/20090901}name").text
        attributes = table.findall(".//{http://www.omg.org/spec/UML/20090901}Property")

        table_attributes = []
        for attr in attributes:
            attr_name = attr.find("{http://www.omg.org/spec/UML/20090901}name").text
            attr_type = attr.find("{http://www.omg.org/spec/UML/20090901}type").text
            table_attributes.append([attr_name, attr_type])

        table_data.append((table_name, table_attributes))
    
    return table_data

def export_to_csv(table_data):
    for table_name, attributes in table_data:
        csv_file_name = f"{table_name}.csv"
        with open(csv_file_name, 'w', newline='') as csvfile:
            csv_writer = csv.writer(csvfile)
            csv_writer.writerow(['Attribute Name', 'Attribute Type'])

            for attr_name, attr_type in attributes:
                csv_writer.writerow([attr_name, attr_type])

if __name__ == "__main__":
    xml_file = r"Gemeentelijk Gegevensmodel XMI2.1.2.xml"
    table_data = extract_tables_and_attributes(xml_file)
    export_to_csv(table_data)

但是,我遇到了以下错误:


Traceback (most recent call last):
  File "ggm to csv.py", line 39, in <module>
    table_data = extract_tables_and_attributes(xml_file)
  File "ggm to csv.py", line 6, in extract_tables_and_attributes
    tree = ET.parse(xml_file, parser=parser)
  File "xml\etree\ElementTree.py", line 1203, in parse
    tree.parse(source, parser)
  File "xml\etree\ElementTree.py", line 571, in parse
    parser.feed(data)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 71765, column 176

正在使用以下 XML,并且似乎不存在任何格式良好的问题。如果有人能帮助我解决这个问题,那就太好了。

https://github.com/Gegevensmodel%20XMI2.1.2.xml

我尝试使用从 Enterprise Architect 到 xml (uml) 的不同导出来运行具有多个版本的脚本,但我在同一行上不断遇到相同的问题。

编辑:

期望的结果看起来像这样(类型类的表(“PAND”)及其类型属性的属性。(这是一个表,但我期望许多 csv 都有自己的表):

属性名称、属性类型 Bruto inhoud pand、EAJava_N6 数据开始 geldigheid pand,EAJava_DATUM 基准面 geldigheid pand,EAJava_DATUM GeometriePunt,EAJava_GM_Point Hoogste bouwlaag pand,EAJava_N3 标识 BGTND、EAJava_NEN3610ID Ind 计划对象、EAJava_INDIC 指示几何图形,EAJava_INDIC 内胜几何 bovenaanzicht,EAJava_GM_Object 获胜几何 maaiveld,EAJava_GM_Object Laagste bouwlaag pand,EAJava_N3 标签 数字和uidingreeks,EAJava_C74E1553_32AE_4fd8_9796_00C6E1C51A11 Lod1 几何潘德,EAJava_GM_Object Lod2 几何潘德,EAJava_GM_Object Lod3 几何潘德,EAJava_GM_Object Oorspronkelijk bouwjaar pand,EAJava_JAAR Oppervlakte pand,EAJava_N6 Pandgeometrie bovenaanzicht,EAJava_GM_Surface Pandgeometrie maaild,EAJava_GM_MultiSurface Pandidentificatie,EAJava_AB8B30D0_FD1F_4c44_9396_BB05389EA20B Pandstatus,EAJava_E2CC5DFC_C264_4c21_8E47_F551958E1C17 相关 hoogteligging pand,EAJava_N2 状态 voortgang 鲍,EAJava_8C49F097_6D95_4406_B3B7_58AC102B6FD2

python xml csv parsing elementtree
1个回答
0
投票

对我来说,不太清楚您正在搜索什么内容。你能举一个简单的例子来说明你的搜索模式是什么样的吗?

使用 lXML 您可以从 github 解析此文件:

from urllib.request import urlopen
from lxml import etree

import psutil
import time
time_start = time.time()

url = "https://raw.githubusercontent.com/Gemeente-Delft/Gemeentelijk-Gegevensmodel/master/"
file = "Gemeentelijk%20Gegevensmodel%20XMI2.1.2.xml"
fd = url+file

f = urlopen(fd)

for event, elem in etree.iterparse(f, events=['start-ns', 'end'], recover=True):
    if event == "start-ns":
        #print(elem[0])
        pass
    if event == "end" and elem.tag =="packagedElement" and elem.get('{http://schema.omg.org/spec/XMI/2.1}type')=='uml:Class':
        for prob in elem.findall("./ownedAttribute"):
            print(prob.get('name'))

print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))
© www.soinside.com 2019 - 2024. All rights reserved.