我有一个linkId
的列表。
links_o_i = [652518, 345004, 225317, 177396, 551734]
此外,我有一个具有以下结构的XML文件:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE facilities SYSTEM "http://www.matsim.org/files/dtd/facilities_v1.dtd">
<facilities name="Facilities from different sources">
<!-- ====================================================================== -->
<facility id="10002" linkId="666355" x="2684102.0" y="1253168.0">
<activity type="other">
</activity>
<activity type="work">
</activity>
</facility>
<!-- ====================================================================== -->
<facility id="10007" linkId="961312" x="2683486.0" y="1247853.0">
<activity type="other">
</activity>
<activity type="work">
</activity>
</facility>
<!-- ====================================================================== -->
<facility id="100070" linkId="652518" x="2684238.0" y="1246568.0">
<activity type="leisure">
</activity>
<activity type="other">
</activity>
<activity type="work">
</activity>
</facility>
<!-- ====================================================================== -->
<facility id="100071" linkId="1063278" x="2689220.0" y="1243493.0">
<activity type="leisure">
</activity>
<activity type="other">
</activity>
<activity type="work">
</activity>
</facility>
<!-- ====================================================================== -->
<facility id="100072" linkId="786540" x="2680812.0" y="1249375.0">
<activity type="leisure">
</activity>
<activity type="other">
</activity>
<activity type="work">
</activity>
</facility>
<!-- ====================================================================== -->
<facility id="100073" linkId="225317" x="2681506.0" y="1249508.0">
<activity type="other">
</activity>
<activity type="shop">
</activity>
<activity type="work">
</activity>
</facility>
</facilities>
我想解析XML文件并提取x
的相应y
和facility
值,它们在linkId
列表内具有links_o_i
。
目标将是具有linkId
,x
和y
值的三列数据帧。
到目前为止,我的方法没有任何结果,我很难找到原因。必须注意的是,列表以及XML都更大。
import gzip
import xml.etree.ElementTree as ET
from collections import defaultdict
import pandas as pd
tree = ET.iterparse(gzip.open("file.xml.gz", 'r'))
link_coords = defaultdict(list)
for xml_event, elem in tree:
attributes = elem.attrib
if elem.tag == 'facility' \
and elem.attrib["linkId"] in links_o_i:
link_coords[attributes['linkId']].append[attributes['x', 'y']]
elem.clear()
link_coords = pd.DataFrame.from_dict(link_coords)
您可以使用xmltodict将数据解析为dict格式,并提取ur数据:
extract = [{k:v for k,v in ent.items() if k in ['@linkId','@x','@y']}
for ent in xmltodict.parse(data)['facilities']['facility']]
#filter for only entries in the list
res = [ent for ent in extract if int(ent['@linkId']) in links_o_i]
#read into dataframe
pd.DataFrame(res)
@linkId @x @y
0 652518 2684238.0 1246568.0
1 225317 2681506.0 1249508.0