我正在解析一个XML文档,以检索多个getElementsByTagName数据

问题描述 投票:0回答:1

我正在为我所在的公司从事项目。他们有一个生成XML文件的程序,希望将特定的标记名提取并格式化为格式化输出。为此,我转向了Python,目前正在编写两个程序。

第一个程序成功将XML文件中的原始数据格式化为正确缩进的树结构。

第二个程序是我遇到的问题。到目前为止,通过使用minidom模块,我已经能够生成输出,打印出一行一行的七个变量,每个变量都是从XML文件中的特定标记获得的。

挑战是我需要从文档的整个长度中提取数据的每个元素标签都有多个结果。

整个XML文档太大而无法在此站点上发布,并且包含敏感数据,因此我必须截断并修改其中的一部分,以便至少可以看到层次结构。

<ws_Worker>
    <ws_Summary>
      <ws_Employee_ID>555555</ws_Employee_ID>
    <ws_Name>John Doe</ws_Name>
    </ws_Summary>
  <ws_Eligibility ws_PriorValue="false">true</ws_Eligibility>
  <ws_Personal>
      <ws_Name_Data>
        <ws_Name_Type>Legal</ws_Name_Type>
      <ws_First_Name>John</ws_First_Name>
      <ws_Last_Name>Doe</ws_Last_Name>
      <ws_Formatted_Name>John Doe</ws_Formatted_Name>
      <ws_Reporting_Name>Doe, John</ws_Reporting_Name>
      </ws_Name_Data>
    <ws_Address_Data>
        <ws_Address_Type>WORK</ws_Address_Type>
      <ws_Address_Is_Public>true</ws_Address_Is_Public>
      <ws_Is_Primary>true</ws_Is_Primary>
      <ws_Address_Line_Data ws_Label="Address Line 1" ws_Type="ADDRESS_LINE_1">123 Sixth St.</ws_Address_Line_Data>
      <ws_Municipality>Baltimore</ws_Municipality>
      <ws_Region>Maryland</ws_Region>
      <ws_Postal_Code>12345</ws_Postal_Code>
      <ws_Country>US</ws_Country>
      </ws_Address_Data>
    <ws_Email_Data>
        <ws_Email_Type>WORK</ws_Email_Type>
      <ws_Email_Is_Public>true</ws_Email_Is_Public>
      <ws_Is_Primary>true</ws_Is_Primary>
      <ws_Email_Address ws_PriorValue="[email protected]">[email protected]</ws_Email_Address>
      </ws_Email_Data>
    <ws_Tobacco_Use>false</ws_Tobacco_Use>
    </ws_Personal>
  <ws_Status>
      <ws_Employee_Status>Active</ws_Employee_Status>
    <ws_Active>true</ws_Active>
    <ws_Active_Status_Date>2020-01-01</ws_Active_Status_Date>
    <ws_Hire_Date>2020-01-01</ws_Hire_Date>
    <ws_Original_Hire_Date>2015-01-01</ws_Original_Hire_Date>
    <ws_Hire_Reason>Hire_Employee_Rehire_Employee_After_13_Weeks</ws_Hire_Reason>
    <ws_Continuous_Service_Date>2020-01-01</ws_Continuous_Service_Date>
    <ws_First_Day_of_Work>2020-01-01</ws_First_Day_of_Work>
    <ws_Retirement_Eligibility_Date>2016-10-01</ws_Retirement_Eligibility_Date>
    <ws_Retired>false</ws_Retired>
    <ws_Seniority_Date>2015-10-01</ws_Seniority_Date>
    <ws_Terminated>false</ws_Terminated>
    <ws_Not_Eligible_for_Hire>false</ws_Not_Eligible_for_Hire>
    <ws_Regrettable_Termination>false</ws_Regrettable_Termination>
    <ws_Resignation_Date>2018-11-01</ws_Resignation_Date>
    <ws_Not_Returning>false</ws_Not_Returning>
    <ws_Return_Unknown>false</ws_Return_Unknown>
    <ws_Has_International_Assignment>false</ws_Has_International_Assignment>
    <ws_Home_Country>US</ws_Home_Country>
    <ws_Rehire>true</ws_Rehire>
    </ws_Status>
  <ws_Position>
      <ws_Operation>NONE</ws_Operation>
    <ws_Position_ID>12345</ws_Position_ID>
    <ws_Effective_Date>2020-01-10</ws_Effective_Date>
    <ws_Primary_Position>true</ws_Primary_Position>
    <ws_Position_Title>Driver</ws_Position_Title>
    <ws_Business_Title>Driver</ws_Business_Title>
    <ws_Worker_Type>Regular</ws_Worker_Type>
    <ws_Position_Time_Type>Part_time</ws_Position_Time_Type>
    <ws_Job_Exempt>false</ws_Job_Exempt>
    <ws_Scheduled_Weekly_Hours>29</ws_Scheduled_Weekly_Hours>
    <ws_Default_Weekly_Hours>40</ws_Default_Weekly_Hours>
    <ws_Full_Time_Equivalent_Percentage>72.5</ws_Full_Time_Equivalent_Percentage>
    <ws_Exclude_from_Headcount>false</ws_Exclude_from_Headcount>
    <ws_Pay_Rate_Type>Hourly</ws_Pay_Rate_Type>
    <ws_Workers_Compensation_Code>1234</ws_Workers_Compensation_Code>
    <ws_Job_Profile>DRIVER</ws_Job_Profile>
    <ws_Management_Level>Individual Contributor</ws_Management_Level>
    <ws_Job_Family>DRV</ws_Job_Family>
    <ws_Business_Site>LOC_TOWN</ws_Business_Site>
    <ws_Business_Site_Name>Local Town</ws_Business_Site_Name>
    <ws_Business_Site_Address_Line_Data ws_Label="Address Line 1" ws_Type="ADDRESS_LINE_1">1234 Sixth St.</ws_Business_Site_Address_Line_Data>
    <ws_Business_Site_Municipality>Baltimore</ws_Business_Site_Municipality>
    <ws_Business_Site_Region>Maryland</ws_Business_Site_Region>
    <ws_Business_Site_Postal_Code>12345</ws_Business_Site_Postal_Code>
    <ws_Business_Site_Country>US</ws_Business_Site_Country>
    <ws_Supervisor>
        <ws_Operation>NONE</ws_Operation>
      <ws_Supervisor_ID>1234567</ws_Supervisor_ID>
      <ws_Supervisor_Name>Little Mac</ws_Supervisor_Name>
      </ws_Supervisor>
    </ws_Position>
  <ws_Additional_Information>
      <ws_WD_Username>John.Doe</ws_WD_Username>
    <ws_Last_4_SSN_Digits>1234</ws_Last_4_SSN_Digits>
    </ws_Additional_Information>
  </ws_Worker>

请记住,此文件中还有36个其他元素。

到目前为止是我的程序:

from xml.dom import minidom

xmldoc = minidom.parse("//tocp-fs1/mydocs/mantonishak/Documents/Python/The_Hard_Way/Out.xml")

outworkers = xmldoc.getElementsByTagName("ws_Worker")[0]
# Knowing your heiarchy is important.  ws_Worker is at the top.  Asking the first value of the list.
outsummaries = outworkers.getElementsByTagName("ws_Summary")
outpersonals = outworkers.getElementsByTagName("ws_Personal")
outpositions = outworkers.getElementsByTagName("ws_Position")
outadditionals = outworkers.getElementsByTagName("ws_Additional_Information")

for outpersonal in outpersonals:
    desc = outpersonal.getElementsByTagName("ws_Formatted_Name")[0].firstChild.data
    # displays the user's Full Name
    for outsummary in outsummaries:
        desc2 = outsummary.getElementsByTagName("ws_Employee_ID")[0].firstChild.data
        # displays the user's Workday ID
    for location in outpositions:
       desc3 = location.getElementsByTagName("ws_Business_Site_Name")[0].firstChild.data
       # displays the user's current work location (Store Name)
    for title in outpositions:
        desc4 = title.getElementsByTagName("ws_Position_Title")[0].firstChild.data
        # displays the user's current title
    for email in outpersonals:
        desc5 = email.getElementsByTagName("ws_Email_Address")[0].firstChild.data
        lst = desc5.split("@")
        atsign = (lst[1])
        # This splits the ws_Email_Address value at the @ sign, removes it, and displays the string
        # to the right of the @ sign (which is the domain)
    for firstletter in outpersonals:
        desc6 = firstletter.getElementsByTagName("ws_First_Name")[0].firstChild.data
        firstletter = desc6[0]
        # This grabs the first letter of the ws_First_Name value so it can be combined later with
        # the ws_Last_Name value to create the username
    for lastname in outpersonals:
        desc7 = lastname.getElementsByTagName("ws_Last_Name")[0].firstChild.data
        username = (firstletter + desc7)
        # grabs the last name and combines with the first letter of the first name
        # this creates the username
    for ssn in outadditionals:
        desc8 = ssn.getElementsByTagName("ws_Last_4_SSN_Digits")[0].firstChild.data
        firstpass = desc6[0:2]
        lastpass = desc7[-2:]
        password = (firstpass + desc8 + lastpass)
        # this takes the first two chars of the ws_First_Name adds them as a string with the
        # ws_Last_4_SSN_Digits and the last two chars of ws_Last_Name.
    print("Full Name: %s, Employee ID: %s, Location: %s, Title: %s, Domain: %s, Username: %s, Password: %s" %
            (desc, desc2, desc3, desc4, atsign, username.lower(), password.lower()))
            # Creates the output in a straight horizontal line.  The .lower attributes for
            # username and password will format all characters in the strings above into lowercase.

我的输出看起来像这样:

Full Name: John Doe, Employee ID: 1234567, Location: Local Town, Title: Driver, Domain: company.com, Username: jdoe, Password: jo1234oe

所以第5行是我认为必须发生魔术的地方。整数[0]仅在第一个元素内提取子标记。如果我将整数更改为[1],它将拉第二个[2]拉第三个,依此类推。]

我如何构建一个循环来更改该整数并在整个文件中共同打印每个元素的输出?

我正在为我所在的公司从事项目。他们有一个生成XML文件的程序,希望将特定的标记名提取并格式化为格式化输出。为此,我要...

python xml loops getelementsbytagname
1个回答
0
投票

使用lxml和xpath:

© www.soinside.com 2019 - 2024. All rights reserved.