所以我有一个52M
xml文件,其中包含115139
个元素。
from lxml import etree
tree = etree.parse(file)
root = tree.getroot()
In [76]: len(root)
Out[76]: 115139
我具有此功能,可以遍历元素root
并将每个已解析的元素插入Pandas DataFrame中。
def fnc_parse_xml(file, columns):
start = datetime.datetime.now()
df = pd.DataFrame(columns=columns)
tree = etree.parse(file)
root = tree.getroot()
xmlns = './/{' + root.nsmap[None] + '}'
for loc,e in enumerate(root):
tot = []
for column in columns:
tot.append(e.find(xmlns + column).text)
df.at[loc,columns] = tot
end = datetime.datetime.now()
diff = end-start
return df,diff
此过程有效,但是需要很多时间。我有一个配备16GB RAM的i7。
In [75]: diff.total_seconds()/60
Out[75]: 36.41769186666667
In [77]: len(df)
Out[77]: 115139
我很确定有更好的方法将52M
xml文件解析为Pandas DataFrame。
这是xml文件的一部分...
<findToFileResponse xmlns="xmlapi_1.0">
<equipment.MediaIndependentStats>
<rxOctets>0</rxOctets>
<txOctets>0</txOctets>
<inSpeed>10000000</inSpeed>
<outSpeed>10000000</outSpeed>
<time>1587080746395</time>
<seconds>931265</seconds>
<port>Port 3/1/6</port>
<ip>192.168.157.204</ip>
<name>RouterA</name>
</equipment.MediaIndependentStats>
<equipment.MediaIndependentStats>
<rxOctets>0</rxOctets>
<txOctets>0</txOctets>
<inSpeed>100000</inSpeed>
<outSpeed>100000</outSpeed>
<time>1587080739924</time>
<seconds>928831</seconds>
<port>Port 1/1/1</port>
<ip>192.168.154.63</ip>
<name>RouterB</name>
</equipment.MediaIndependentStats>
</findToFileResponse>
关于如何提高速度的任何想法?
对于上述xml的摘录,函数fnc_parse_xml(file, columns)
返回此DF...。
In [83]: df
Out[83]:
rxOctets txOctets inSpeed outSpeed time seconds port ip name
0 0 0 10000000 10000000 1587080746395 931265 Port 3/1/6 192.168.157.204 RouterA
1 0 0 100000 100000 1587080739924 928831 Port 1/1/1 192.168.154.63 RouterB
另一个选择而不是通过分析整个XML文件来构建树是使用iterparse ...
import datetime
import pandas as pd
from lxml import etree
def fnc_parse_xml(file, columns):
start = datetime.datetime.now()
# Capture all rows in array.
rows = []
# Process all "equipment.MediaIndependentStats" elements.
for event, elem in etree.iterparse(file, tag="{xmlapi_1.0}equipment.MediaIndependentStats"):
# Each row is a new dict.
row = {}
# Process all chidren of "equipment.MediaIndependentStats".
for child in elem.xpath("./*"):
# Create an entry in the row dict using the local name (without namespace) of the element for
# the key and the text content as the value.
row[etree.QName(child.tag).localname] = child.text
# Append the row dict to the rows array.
rows.append(row)
end = datetime.datetime.now()
diff = end - start
# Create the DateFrame on return. This would probably be better in a try/catch to handle errors.
return pd.DataFrame(rows, columns=columns), diff
print(fnc_parse_xml("input.xml",
["rxOctets", "txOctets", "inSpeed", "outSpeed", "time", "seconds", "port", "ip", "name"]))
也see here有关lxml中XPath的更多信息。
在我的计算机上,此文件将在9秒钟内处理92.5MB的文件。
您声明一个空的数据框,因此如果提前指定索引,则可能会加快速度。否则,数据帧将不断扩展。
df = pd.DataFrame(index=range(0, len(root)))
您还可以在循环结束时创建数据框。
vals = [[e.find(xmlns + column).text for column in columns] for e in root]
df = pd.DataFrame(data=vals, columns=['rxOctets', ...])
我们将使用库xmltodict-使您可以像dict / json一样对待xml文档。您感兴趣的数据嵌入在设备中。MediaIndependentStats'key':
import xmltodict
data = """<findToFileResponse xmlns="xmlapi_1.0">
<equipment.MediaIndependentStats>
<rxOctets>0</rxOctets>
<txOctets>0</txOctets>
<inSpeed>10000000</inSpeed>
<outSpeed>10000000</outSpeed>
<time>1587080746395</time>
<seconds>931265</seconds>
<port>Port 3/1/6</port>
<ip>192.168.157.204</ip>
<name>RouterA</name>
</equipment.MediaIndependentStats>
<equipment.MediaIndependentStats>
<rxOctets>0</rxOctets>
<txOctets>0</txOctets>
<inSpeed>100000</inSpeed>
<outSpeed>100000</outSpeed>
<time>1587080739924</time>
<seconds>928831</seconds>
<port>Port 1/1/1</port>
<ip>192.168.154.63</ip>
<name>RouterB</name>
</equipment.MediaIndependentStats>
</findToFileResponse>"""
pd.concat(pd.DataFrame.from_dict(ent,orient='index').T
for ent in xmltodict.parse(data)['findToFileResponse']['equipment.MediaIndependentStats'])
rxOctets txOctets inSpeed outSpeed time seconds port ip name
0 0 0 10000000 10000000 1587080746395 931265 Port 3/1/6 192.168.157.204 RouterA
0 0 0 100000 100000 1587080739924 928831 Port 1/1/1 192.168.154.63 RouterB