我有一个熊猫DataFrame
,其中包含数据库中的层次结构元素。 parent_id
列告诉我每个孩子属于哪个父级。我想从此DataFrame生成一个嵌套的JSON(有关所需的格式,请参见第3节)。我尝试了很多事情,但是我主要停留在每个父项的_children
列表字段中,该字段应包含其所有子项。非常感谢任何帮助,谢谢!
1。熊猫DataFrame(我所拥有的)
parent_id child_id level
125582 214659 1
125582 214633 1
125582 214263 1
214263 131673 2
214263 125579 2
214263 125578 2
214263 172670 2
214263 214266 2
214266 216675 3
214266 216671 3
172670 172669 3
172670 174777 3
172670 207661 3
207661 216529 4
207661 223884 4
223884 223885 5
223885 229186 6
229186 219062 7
229186 222243 7
and so on...
2。 DataFrame逻辑
125582 (level 0 - root)
--214659 (level 1)
--214633 (level 1)
--214263 (level 1)
----131673 (level 2)
----125579 (level 2)
----125578 (level 2)
----172670 (level 2)
------172669 (level 3)
------174777 (level 3)
------207661 (level 3)
--------216529 (level 4)
--------223884 (level 4)
----------223885 (level 5)
------------229186 (level 6)
--------------219062 (level 7)
--------------222243 (level 7)
----214266 (level 2)
------216675 (level 3)
------216671 (level 3)
and so on...
3。 JSON(我需要)
[
{id: 125582, level: 0, "_children": [
{id: 214659, level: 1},
{id: 214633, level: 1},
{id: 214263, level: 1, "_children": [
{id: 131673, level: 2},
{id: 125579, level: 2},
{id: 125578, level: 2},
{id: 172670, level: 2, "_children": [
{id: 172669, level: 3},
{id: 174777, level: 3},
{id: 207661, level: 3, "_children": [
{id: 216529, level: 4},
{id: 223884, level: 4, "_children": [
{id: 223885, level: 5, "_children": [
{id: 229186, level: 6, "_children": [
{id: 219062, level: 7},
{id: 222243, level: 7}
]}
]}
]}
]}
]},
{id: 214266, level: 2, "_children": [
{id: 216675, level: 3},
{id: 216671, level: 3}
]}
]},
and so on...
]}
]
此分层json具有父级->子级关系。因此,建议您使用networkx
包从日期框架构造图形g
。从图g
创建所需的json。
您需要pip
在networkx
之后安装this instruction
之后,请执行以下步骤:
g
创建有向图from_pandas_edgelist
d
和parent_id
上的唯一键创建字典child_id
值为level
d
分配给图level
的节点属性g
nx.tree_data
上呼叫g
代码如下:
import networkx as nx
#create the directed graph `g` using `from_pandas_edgelist`
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
df1 = df.melt('level')
d = dict(zip(df1.value, df1.level))
root_id = df.loc[0, 'parent_id']
d[root_id] = 0 #set level of root to `0`
#assign `d` to nodes attributes `level` of graph `g`
nx.set_node_attributes(g, d, 'level') #add `level` values to nodes of `g`
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[119]:
[{'level': 0,
'id': 125582,
'_children': [{'level': 1, 'id': 214659},
{'level': 1, 'id': 214633},
{'level': 1,
'id': 214263,
'_children': [{'level': 2, 'id': 131673},
{'level': 2, 'id': 125579},
{'level': 2, 'id': 125578},
{'level': 2,
'id': 172670,
'_children': [{'level': 3, 'id': 172669},
{'level': 3, 'id': 174777},
{'level': 3,
'id': 207661,
'_children': [{'level': 4, 'id': 216529},
{'level': 4,
'id': 223884,
'_children': [{'level': 5,
'id': 223885,
'_children': [{'level': 6,
'id': 229186,
'_children': [{'level': 7, 'id': 219062},
{'level': 7, 'id': 222243}]}]}]}]}]},
{'level': 2,
'id': 214266,
'_children': [{'level': 3, 'id': 216675},
{'level': 3, 'id': 216671}]}]}]}]
Note:上面的输出是list内的字典,是您想要的输出。我的解决方案输出将键level
放在键id
的前面,但是子词典的层次/结构与所需的输出相同
另一方面,如果所需输出中的不需要level
值,则直接使用内置的nx.tree_data
函数按如下所示返回输出
import networkx as nx
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', ['level'],
create_using=nx.DiGraph())
root_id = df.loc[0, 'parent_id']
out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
Out[168]:
[{'id': 125582,
'_children': [{'id': 214659},
{'id': 214633},
{'id': 214263,
'_children': [{'id': 131673},
{'id': 125579},
{'id': 125578},
{'id': 172670,
'_children': [{'id': 172669},
{'id': 174777},
{'id': 207661,
'_children': [{'id': 216529},
{'id': 223884,
'_children': [{'id': 223885,
'_children': [{'id': 229186,
'_children': [{'id': 219062}, {'id': 222243}]}]}]}]}]},
{'id': 214266, '_children': [{'id': 216675}, {'id': 216671}]}]}]}]
更新:
使用nx.tree_data
的修改后的功能可以克服对每个孩子单亲的限制。将此自定义函数添加到您的代码中,然后调用它,而不是nx.tree_data
from itertools import chain
def tree_data_custom(G, root):
id_ = 'id'
children = '_children'
def add_children(n, G):
nbrs = G[n]
if len(nbrs) == 0:
return []
children_ = []
for child in nbrs:
d = dict(chain(G.nodes[child].items(), [(id_, child)]))
c = add_children(child, G)
if c:
d[children] = c
children_.append(d)
return children_
data = dict(chain(G.nodes[root].items(), [(id_, root)]))
data[children] = add_children(root, G)
return data
代替呼叫out = [nx.tree_data(g, root_id, {'id': 'id', 'children': '_children'})]
,只需如下呼叫tree_data_custom
out = [tree_data_custom(g, root_id)]
更新2:将name
列添加到数据框
样本df
,其中孩子有多个父母
Out[258]:
parent_id child_id level name
0 125582 214659 1 a1
1 125582 214633 1 a1
2 125582 214263 1 a1
3 214263 131673 2 a2
4 214263 125579 2 a2
5 214263 125578 2 a2
6 214263 172670 2 a2
7 214263 214266 2 a2
8 214266 216675 3 a3
9 214266 216671 3 a3
10 172670 172669 3 a3
11 172670 174777 3 a3
12 172670 207661 3 a3
13 207661 216529 4 a4
14 207661 223884 4 a4
15 223884 223885 5 a5
16 223885 229186 6 a6
17 229186 219062 7 a7
18 229186 222243 7 a7
19 222243 219187 8 a8
20 222243 245985 8 a8
21 222243 232393 8 a8
22 222243 247138 8 a8
23 222243 228848 8 a8
24 222243 228848 8 a8
25 222243 233920 8 a8
26 222243 233920 8 a8
27 222243 228113 8 a8
28 222243 233767 8 a8
29 222243 235407 8 a8
30 222243 237757 8 a8
31 222243 159091 8 a8
32 222243 159091 8 a8
33 222243 214832 8 a8
34 222243 253990 8 a8
35 222243 231610 8 a8
36 222243 231610 8 a8
37 222243 182323 8 a8
38 222243 242190 8 a8
39 222243 143580 8 a8
40 222243 242188 8 a8
41 222243 143581 8 a8
42 222243 242187 8 a8
43 222243 143582 8 a8
44 222243 242189 8 a8
45 222243 205877 8 a8
46 222243 242823 8 a8
47 222243 140979 8 a8
48 222243 237824 8 a8
49 222243 149933 8 a8
50 222243 149933 8 a8
51 222243 153625 8 a8
52 222243 8392 8 a8
53 222243 162085 8 a8
54 222243 162085 8 a8
55 222243 150691 8 a8
56 222243 147773 8 a8
57 222243 147773 8 a8
58 222243 61070 8 a8
59 222243 61070 8 a8
60 222243 204850 8 a8
61 222243 204850 8 a8
62 61070 46276 9 a9
63 61070 46276 9 a9
64 61070 46276 9 a9
65 61070 46276 9 a9
66 143580 159911 9 a9
67 143580 38958 9 a9
68 182323 159911 9 a9
更改很小,您只需要修改g
的节点属性的步骤创建字典。当前,我们仅将属性level
添加到g
。现在,我们需要为name
创建另一个字典,并将其添加到name
g
中#create the directed graph `g` using `from_pandas_edgelist`.
#You don't need `[level]` in this step
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
#create a dictionary with unique key on `parent_id` and `child_id`
#`melt` keep 2 columns 'level', 'name' instead of one column 'level'
#dictionary `d_level` for attribute `level` of `g's` nodes
#dictionary `d_name` for attribute `level` of `g's` nodes
df1 = df.melt(['level', 'name'])
d_level = dict(zip(df1.value, df1.level))
d_name = dict(zip(df1.value, df1.name))
root_id = df.loc[0, 'parent_id']
d_level[root_id] = 0 #set `level` of root to `0`
#assign `d_level` to nodes attributes `level` of graph `g` and `d_name` for `name`
nx.set_node_attributes(g, d_level, 'level') #add `level` values to nodes of `g`
nx.set_node_attributes(g, d_name, 'name') #add `name` values to nodes of `g`
#use customize `tree_data_custom` defined previously
out = [tree_data_custom(g, root_id)]
如果向每行添加更多列,则最好为所有列创建一个字典,并按如下所示将一次应用于g
的节点
g = nx.from_pandas_edgelist(df, 'parent_id', 'child_id', create_using=nx.DiGraph())
df1 = df.melt(['level', 'name'])
#this single dictionary to create both `level` and `name` attributes of nodes of `g`
d = {v: {'level': l, 'name': n} for v,l,n in zip(df1.value, df1.level, df1.name)}
root_id = df.loc[0, 'parent_id']
d[root_id]['level'] = 0 #set level of root to `0`
nx.set_node_attributes(g, d) #a single add for both attributes `level`, `name`
out = [tree_data_custom(g, root_id)]
输出
Out[260]:
[{'level': 0,
'name': 'a1',
'id': 125582,
'_children': [{'level': 1, 'name': 'a1', 'id': 214659},
{'level': 1, 'name': 'a1', 'id': 214633},
{'level': 1,
'name': 'a1',
'id': 214263,
'_children': [{'level': 2, 'name': 'a2', 'id': 131673},
{'level': 2, 'name': 'a2', 'id': 125579},
{'level': 2, 'name': 'a2', 'id': 125578},
{'level': 2,
'name': 'a2',
'id': 172670,
'_children': [{'level': 3, 'name': 'a3', 'id': 172669},
{'level': 3, 'name': 'a3', 'id': 174777},
{'level': 3,
'name': 'a3',
'id': 207661,
'_children': [{'level': 4, 'name': 'a4', 'id': 216529},
{'level': 4,
'name': 'a4',
'id': 223884,
'_children': [{'level': 5,
'name': 'a5',
'id': 223885,
'_children': [{'level': 6,
'name': 'a6',
'id': 229186,
'_children': [{'level': 7, 'name': 'a7', 'id': 219062},
{'level': 7,
'name': 'a7',
'id': 222243,
'_children': [{'level': 8, 'name': 'a8', 'id': 219187},
{'level': 8, 'name': 'a8', 'id': 245985},
{'level': 8, 'name': 'a8', 'id': 232393},
{'level': 8, 'name': 'a8', 'id': 247138},
{'level': 8, 'name': 'a8', 'id': 228848},
{'level': 8, 'name': 'a8', 'id': 233920},
{'level': 8, 'name': 'a8', 'id': 228113},
{'level': 8, 'name': 'a8', 'id': 233767},
{'level': 8, 'name': 'a8', 'id': 235407},
{'level': 8, 'name': 'a8', 'id': 237757},
{'level': 8, 'name': 'a8', 'id': 159091},
{'level': 8, 'name': 'a8', 'id': 214832},
{'level': 8, 'name': 'a8', 'id': 253990},
{'level': 8, 'name': 'a8', 'id': 231610},
{'level': 8,
'name': 'a8',
'id': 182323,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911}]},
{'level': 8, 'name': 'a8', 'id': 242190},
{'level': 8,
'name': 'a8',
'id': 143580,
'_children': [{'level': 9, 'name': 'a9', 'id': 159911},
{'level': 9, 'name': 'a9', 'id': 38958}]},
{'level': 8, 'name': 'a8', 'id': 242188},
{'level': 8, 'name': 'a8', 'id': 143581},
{'level': 8, 'name': 'a8', 'id': 242187},
{'level': 8, 'name': 'a8', 'id': 143582},
{'level': 8, 'name': 'a8', 'id': 242189},
{'level': 8, 'name': 'a8', 'id': 205877},
{'level': 8, 'name': 'a8', 'id': 242823},
{'level': 8, 'name': 'a8', 'id': 140979},
{'level': 8, 'name': 'a8', 'id': 237824},
{'level': 8, 'name': 'a8', 'id': 149933},
{'level': 8, 'name': 'a8', 'id': 153625},
{'level': 8, 'name': 'a8', 'id': 8392},
{'level': 8, 'name': 'a8', 'id': 162085},
{'level': 8, 'name': 'a8', 'id': 150691},
{'level': 8, 'name': 'a8', 'id': 147773},
{'level': 8,
'name': 'a8',
'id': 61070,
'_children': [{'level': 9, 'name': 'a9', 'id': 46276}]},
{'level': 8, 'name': 'a8', 'id': 204850}]}]}]}]}]}]},
{'level': 2,
'name': 'a2',
'id': 214266,
'_children': [{'level': 3, 'name': 'a3', 'id': 216675},
{'level': 3, 'name': 'a3', 'id': 216671}]}]}]}]
这不是熊猫的单行代码,但这是一个应该起作用的递归函数。
import json
import pandas as pd
df = pd.DataFrame(
[{"parent_id": 125582, "child_id": 214659, "level": 1}, {"parent_id": 125582, "child_id": 214633, "level": 1},
{"parent_id": 125582, "child_id": 214263, "level": 1}, {"parent_id": 214263, "child_id": 131673, "level": 2},
{"parent_id": 214263, "child_id": 125579, "level": 2}, {"parent_id": 214263, "child_id": 125578, "level": 2},
{"parent_id": 214263, "child_id": 172670, "level": 2}, {"parent_id": 214263, "child_id": 214266, "level": 2},
{"parent_id": 214266, "child_id": 216675, "level": 3}, {"parent_id": 214266, "child_id": 216671, "level": 3},
{"parent_id": 172670, "child_id": 172669, "level": 3}, {"parent_id": 172670, "child_id": 174777, "level": 3},
{"parent_id": 172670, "child_id": 207661, "level": 3}, {"parent_id": 207661, "child_id": 216529, "level": 4},
{"parent_id": 207661, "child_id": 223884, "level": 4}, {"parent_id": 223884, "child_id": 223885, "level": 5},
{"parent_id": 223885, "child_id": 229186, "level": 6}, {"parent_id": 229186, "child_id": 219062, "level": 7},
{"parent_id": 229186, "child_id": 222243, "level": 7}]
)
def recurse(parent_id, level):
# create the base result
result = {"id": int(parent_id), "level": int(level)}
# get all of the children of this parent, one level below this one
children = df[(df.parent_id == parent_id) & (df["level"] == level + 1)]
# if there are no such children, then return without a _children key
if children.empty:
return result
# otherwise, recurse on each child_id
result["_children"] = [recurse(child_id, level + 1) for child_id in sorted(children.child_id.unique())]
return result
# start at an implicit level 0
tree = [recurse(parent_id, 0) for parent_id in sorted(df[df["level"] == 1].parent_id.unique())]
# convert to JSON
print(json.dumps(tree, indent=3, sort_keys=True))
id_to_obj = {}
roots = []
def parse(row):
pid, cid, lvl = row.parent_id, row.child_id, row.level
if pid not in id_to_obj:
# parent not created yet: create parent as root
id_to_obj[pid] = {
'id': pid, 'level': 0
}
roots.append(pid)
# Get parent object
parent = id_to_obj[pid]
if not '_children' in parent:
parent['_children'] = []
# Create child object and add to parent's children list
child = {
'id': cid, 'level': lvl
}
parent['_children'].append(child)
id_to_obj[cid] = child
df.sort_values(by='level').apply(parse, axis=1)
# Result
[id_to_obj[pid] for pid in roots]