在Python中解析多个xml文件

问题描述 投票:2回答:2

我在这里遇到了问题。所以我想解析其中包含相同结构的多个xml文件。我已经能够获取每个文件的所有位置并将它们保存到三个不同的列表中,因为有三种不同类型的xml结构。现在我想创建三个函数(对于每个列表),它循环遍历列表并解析我需要的信息。不知怎的,我无法做到。这里的任何人都可以给我一个提示怎么做?

import os
import glob
import xml.etree.ElementTree as ET
import fnmatch
import re
import sys


#### Get the location of each XML file and save them into a list ####

all_xml_list =[]                                                                                                                                       

def locate(pattern,root=os.curdir):
    for path, dirs, files in os.walk(os.path.abspath(root)):
        for filename in fnmatch.filter(files,pattern):
            yield os.path.join(path,filename)

for files in locate('*.xml',r'C:\Users\Lars\Documents\XML-Files'):
    all_xml_list.append(files)


#### Create lists by GameDay Events ####


xml_GameDay_Player   = [x for x in all_xml_list if 'Player' in x]                                                             
xml_GameDay_Team     = [x for x in all_xml_list if 'Team' in x]                                                             
xml_GameDay_Match    = [x for x in all_xml_list if 'Match' in x]  

XML文件如下所示:

<sports-content xmlns:imp="url">
  <sports-metadata date-time="20160912T000000+0200" doc-id="sports_event_" publisher="somepublisher" language="en_EN" document-class="player-statistics">
    <sports-title>player-statistics-165483</sports-title>
  </sports-metadata>
  <sports-event>
    <event-metadata id="E_165483" event-key="165483" event-status="post-event" start-date-time="20160827T183000+0200" start-weekday="saturday" heat-number="1" site-attendance="52183" />
    <team>
      <team-metadata id="O_17" team-key="17">
        <name full="TeamName" nickname="NicknameoftheTeam" imp:dfl-3-letter-code="NOT" official-3-letter-code="" />
      </team-metadata>
      <player>
        <player-metadata player-key="33201" uniform-number="1">
          <name first="Max" last="Mustermann" full="Max Mustermann" nickname="Mäxchen" imp:extensive="Name" />
        </player-metadata>
        <player-stats stats-coverage="standard" date-coverage-type="event" minutes-played="90" score="0">
          <rating rating-type="standard" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="5.6" imp:rating-value-mid-fielder="5.8" imp:rating-value-forward="5.0" />
          <rating rating-type="grade" rating-value="2.2" />
          <rating rating-type="index" imp:rating-value-goalie="7.6" imp:rating-value-defenseman="3.7" imp:rating-value-mid-fielder="2.5" imp:rating-value-forward="1.2" />
          <rating rating-type="bemeister" rating-value="16.04086" />
          <player-stats-soccer imp:duels-won="1" imp:duels-won-ground="0" imp:duels-won-header="1" imp:duels-lost-ground="0" imp:duels-lost-header="0" imp:duels-lost="0" imp:duels-won-percentage="100" imp:passes-completed="28" imp:passes-failed="4" imp:passes-completions-percentage="87.5" imp:passes-failed-percentage="12.5" imp:passes="32" imp:passes-short-total="22" imp:balls-touched="50" imp:tracking-distance="5579.80" imp:tracking-average-speed="3.41" imp:tracking-max-speed="23.49" imp:tracking-sprints="0" imp:tracking-sprints-distance="0.00" imp:tracking-fast-runs="3" imp:tracking-fast-runs-distance="37.08" imp:tracking-offensive-runs="0" imp:tracking-offensive-runs-distance="0.00" dfl-distance="5579.80" dfl-average-speed="3.41" dfl-max-speed="23.49">
            <stats-soccer-defensive saves="5" imp:catches-punches-crosses="3" imp:catches-punches-corners="0" goals-against-total="1" imp:penalty-saves="0" imp:clear-cut-chance="0" />
            <stats-soccer-offensive shots-total="0" shots-on-goal-total="0" imp:shots-off-post="0" offsides="0" corner-kicks="0" imp:crosses="0" assists-total="0" imp:shot-assists="0" imp:freekicks="3" imp:miss-chance="0" imp:throw-in="0" imp:punt="2" shots-penalty-shot-scored="0" shots-penalty-shot-missed="0" dfl-assists-total="0" imp:shots-total-outside-box="0" imp:shots-total-inside-box="0" imp:shots-foot-inside-box="0" imp:shots-foot-outside-box="0" imp:shots-total-header="0" />
            <stats-soccer-foul fouls-commited="0" fouls-suffered="0" imp:yellow-red-cards="0" imp:red-cards="0" imp:yellow-cards="0" penalty-caused="0" />
          </player-stats-soccer>
        </player-stats>
      </player>
    </team>
  </sports-event>
</sports-content>

我想提取“玩家元标记”和“玩家统计数据覆盖范围”和“玩家统计数据足球”标记内的所有内容。

python-3.x function for-loop xml-parsing
2个回答
2
投票

改进@Gnudiff的答案,这是一种更有弹性的方法:

import os
from glob import glob
from lxml import etree

xml_GameDay = {
    'Player': [],
    'Team': [],
    'Match': [],
}

# sort all files into the right buckets
for filename in glob(r'C:\Users\Lars\Documents\XML-Files\*.xml'):
    for key in xml_GameDay.keys():
        if key in os.path.basename(filename):
            xml_GameDay[key].append(filename)
            break

def select_first(context, path):
    result = context.xpath(path)
    if len(result):
        return result[0]
    return None

# extract data from Player files
for filename in xml_GameDay['Player']:
    tree = etree.parse(filename)

    for player in tree.xpath('.//player'):        
        player_data = {
            'key': select_first(player, './player-metadata/@player-key'),
            'lastname': select_first(player, './player-metadata/name/@last'),
            'firstname': select_first(player, './player-metadata/name/@first'),
            'nickname': select_first(player, './player-metadata/name/@nickname'),
        }
        print(player_data)
        # ...

XML文件可以有各种字节编码,并以XML声明作为前缀,声明了文件其余部分的编码。

<?xml version="1.0" encoding="UTF-8"?>

UTF-8是XML文件的常见编码(它也是默认的),但实际上它可以是任何东西。这是不可能预测的,并且对您的程序进行硬编码以期望某种编码是非常糟糕的做法。

XML解析器旨在以透明的方式处理这种特性,因此您不必担心它,除非您做错了。

这是做错的一个很好的例子:

# BAD CODE, DO NOT USE
def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

tree = etree.XML(file_get_contents('some_filename.xml'))

这里发生的是:

  1. Python打开filename作为文本文件f
  2. f.read()返回一个字符串
  3. etree.XML()解析该字符串并创建一个DOM对象tree

听起来不是那么错,是吗?但是如果XML是这样的:

<?xml version="1.0" encoding="UTF-8"?>
<Player nickname="Mäxchen">...</Player>

那么你最终得到的DOM将是:

Player
    @nickname="Mäxchen"

你刚刚销毁了数据。除非XML包含像ä这样的“扩展”字符,否则你甚至都不会注意到这种方法是不可能的。这很容易被忽视。

打开XML文件只有一种正确的方法(它也比上面的代码更简单):将文件名提供给解析器。

tree = etree.parse('some_filename.xml')

这样,解析器可以在读取数据之前找出文件编码,而您不必关心这些细节。


0
投票

对于您的特定情况,这不是一个完整的解决方案,因为这是一项任务,而且我没有键盘,在平板电脑上工作。

通常,您可以通过多种方式执行此操作,具体取决于您是否确实需要所有数据或提取特定子集,以及您是否事先知道所有可能的结构。

例如,一种方式:

from lxml import etree
Playerdata=[] 
for F in xml_Gameday_Player:
                tree=etree.XML(file_get_contents(F)) 
                for player in tree.xpath('.//player'):
                        row=[] 
                        row['player']=player.xpath('./player-metadata/name/@Last/text()')       
                        for plrdata in player.xpath('.//player-stats'):
                               #do stuff with player data
                         Playerdata+=row

这是根据我现有的脚本改编的,但它更适合于仅提取xml的特定子集。如果您需要所有数据,那么使用某些xml树walker可能会更好。

file_get_contents是一个小帮手函数:

def file_get_contents(filename):
    with open(filename) as f:
        return f.read()

Xpath是一种用于在xml中查找节点的强大语言。请注意,根据您使用的Xpath,结果可能是“for player in ...”语句中的xml节点,也可能是“row ['player'] =”语句中的字符串。

© www.soinside.com 2019 - 2024. All rights reserved.