我有一个XML,结构如下。
<population desc="Switzerland Baseline">
<attributes>
<attribute name="coordinateReferenceSystem" class="java.lang.String" >Atlantis</attribute>
</attributes>
<person id="1015600">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1241567">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="1218895">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
<person id="10002042">
<attributes>
<attribute name="age" class="java.lang.Integer" >86</attribute>
<attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
<attribute name="sex" class="java.lang.String" >f</attribute>
</attributes>
</person>
</population>
我有一个Pandas的数据框架,叫做 agents
,以及相关的 id
s
id
0 1015600
1 1218895
2 1241567
我想要的是通过大的XML来提取出 ptSubscription
对于 person
与相关的 id
.
所需的输出是一个数据帧或列表,其中包含有 id
和值。
id ptSubscription
0 1015600 false
1 1218895 true
2 1241567 true
我的方法是返回一个空输出。
import gzip
import xml.etree.cElementTree as ET
import pandas as pd
from collections import defaultdict
file = 'output_plans.xml.gz'
data = gzip.open(file, 'r')
root = ET.parse(data).getroot()
rows = []
for it in root.iter('person'):
if it.attrib['id'] in agents[["id"]]:
id = it.attrib['id']
age = it.find('attributes/attribute[@name="ptSubscription"]').text
rows.append([id, age])
#root.clear()
pt = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
pt
一个能够使用 lxml 来提取请求信息的通用函数将是
from lxml import etree
from io import StringIO
with open("sample.xml") as fd:
tree = etree.parse(fd)
xpath_fmt = '/population/person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,1218895,1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = tree.xpath(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
使用标准库,该代码将重新组合为
import xml.etree.cElementTree as ET
with open("sample.xml") as fd:
root = ET.parse(fd).getroot()
xpath_fmt = 'person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'
agents = [1015600,
1218895,
1241567]
rows = []
for pid in agents:
xpath = xpath_fmt.format(pid)
r = root.findall(xpath)
for res in r:
rows.append([pid, res.text])
pd.DataFrame(rows, columns=['id', 'PTSubscription'])
由于xpath应该是相对于人口元素的。
我们可以使用 旁听生 来拉出细节。
#read in data :
with open("test.xml") as fd:
tree = fd.read()
import library and parse xml :
from parsel import Selector
selector = Selector(text=tree, type='xml')
#checklist :
agents = ['1015600','1218895','1241567']
#track the ids
#this checks and selects ids in agents
ids = selector.xpath(f".//person[contains({' '.join(agents)!r},@id)]")
#pair ids with attribute where the name == ptSubscription :
d = {}
for ent in ids:
vals = ent.xpath(".//attribute[@name='ptSubscription']/text()").get()
key = ent.xpath(".//@id").get()
d[key] = vals
print(d)
{'1015600': 'false', '1241567': 'true', '1218895': 'true'}
#put into a dataframe :
pd.DataFrame.from_dict(d,orient='index', columns=['PTSubscription'])
import xml.etree.ElementTree as ET
import elementpath
root = ET.parse("test.xml").getroot()
agents = ('1015600','1218895','1241567')
id_path = f".//person[@id={agents}]"
subscription_path = ".//attribute[@name='ptSubscription']/text()"
d = {}
for entry in elementpath.select(root,path):
key = elementpath.select(entry,"./@id")[0]
val = elementpath.select(entry,subscription_path)[0]
d[key] = val