如何根据父值在列表中的出现提取子值，使用python？

Question

我有一个XML，结构如下。

<population desc="Switzerland Baseline">
    <attributes>
        <attribute name="coordinateReferenceSystem" class="java.lang.String" >Atlantis</attribute>
    </attributes>

    <person id="1015600">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
    <person id="10002042">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >false</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
    <person id="1241567">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>   
    <person id="1218895">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>   
    <person id="10002042">
        <attributes>
            <attribute name="age" class="java.lang.Integer" >86</attribute>
            <attribute name="ptSubscription" class="java.lang.Boolean" >true</attribute>
            <attribute name="sex" class="java.lang.String" >f</attribute>
        </attributes>
    </person>
</population>

我有一个Pandas的数据框架，叫做 agents，以及相关的 ids

    id
0   1015600
1   1218895
2   1241567

我想要的是通过大的XML来提取出 ptSubscription 对于 person与相关的 id.

所需的输出是一个数据帧或列表，其中包含有 id 和值。

    id          ptSubscription
0   1015600     false
1   1218895     true
2   1241567     true

我的方法是返回一个空输出。

import gzip
import xml.etree.cElementTree as ET
import pandas as pd
from collections import defaultdict

file = 'output_plans.xml.gz'
data = gzip.open(file, 'r')
root = ET.parse(data).getroot()

rows = []
for it in root.iter('person'):
    if it.attrib['id'] in agents[["id"]]:
        id = it.attrib['id']
        age = it.find('attributes/attribute[@name="ptSubscription"]').text
        rows.append([id, age])
#root.clear()

pt = pd.DataFrame(rows, columns=['id', 'PTSubscription'])
pt

Answer 1

一个能够使用 lxml 来提取请求信息的通用函数将是

from lxml import etree
from io import StringIO

with open("sample.xml") as fd:
    tree = etree.parse(fd)

xpath_fmt = '/population/person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'


agents = [1015600,1218895,1241567]

rows = []
for pid in agents:
    xpath = xpath_fmt.format(pid)
    r = tree.xpath(xpath)
    for res in r:
        rows.append([pid, res.text])

pd.DataFrame(rows, columns=['id', 'PTSubscription'])

使用标准库，该代码将重新组合为

import xml.etree.cElementTree as ET

with open("sample.xml") as fd:
    root = ET.parse(fd).getroot()

xpath_fmt = 'person[@id="{}"]/attributes/attribute[@name="ptSubscription"]'


agents = [1015600,
1218895,
1241567]

rows = []
for pid in agents:
    xpath = xpath_fmt.format(pid)
    r = root.findall(xpath)
    for res in r:
        rows.append([pid, res.text])

pd.DataFrame(rows, columns=['id', 'PTSubscription'])

由于xpath应该是相对于人口元素的。

Answer 2

我们可以使用旁听生来拉出细节。

#read in data : 

with open("test.xml") as fd:
    tree = fd.read()

import library and parse xml :
from parsel import Selector

selector = Selector(text=tree, type='xml')

#checklist : 
agents = ['1015600','1218895','1241567']

#track the ids
#this checks and selects ids in agents
ids = selector.xpath(f".//person[contains({' '.join(agents)!r},@id)]")

#pair ids with attribute where the name == ptSubscription : 

d = {}
for ent in ids:
    vals = ent.xpath(".//attribute[@name='ptSubscription']/text()").get()
    key = ent.xpath(".//@id").get()
    d[key] = vals

print(d)

{'1015600': 'false', '1241567': 'true', '1218895': 'true'}

#put into a dataframe : 
pd.DataFrame.from_dict(d,orient='index', columns=['PTSubscription'])

替代方法：使用python内置的元素树与元素路径 :

import xml.etree.ElementTree as ET
import elementpath
root = ET.parse("test.xml").getroot()

agents = ('1015600','1218895','1241567')

id_path = f".//person[@id={agents}]"
subscription_path = ".//attribute[@name='ptSubscription']/text()"

d = {}
for entry in elementpath.select(root,path):
    key = elementpath.select(entry,"./@id")[0]
    val = elementpath.select(entry,subscription_path)[0]
    d[key] = val

如何根据父值在列表中的出现提取子值，使用python？

问题描述投票：0回答：1

1个回答

最新问题

如何根据父值在列表中的出现提取子值，使用python？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1