带有Pandas pivot_table的嵌套小计“所有”行

问题描述 投票:3回答:2

我有一些看起来像这样的长格式数据(见下面重新创建):

>>> df
   section subsection   name topic     score
0        A          W  zwphf     a  0.802427
1        A          W  jcyyc     a  0.404077
2        A          W  kucem     a  0.367319
3        A          X  ldbxz     a  0.554260
4        A          X  vkcqh     a  0.265864
5        A          X  cvksn     a  0.548099
6        B          Y  spghx     a  0.472612
7        B          Y  cqokn     a  0.577504
8        B          Y  wjsxg     a  0.815309
9        B          Z  holoo     a  0.459850
10       B          Z  lnihf     a  0.667877
11       B          Z  wirhq     a  0.138879
12       A          W  zwphf     b  0.673711
13       A          W  jcyyc     b  0.507962
14       A          W  kucem     b  0.546055
15       A          X  ldbxz     b  0.148214
16       A          X  vkcqh     b  0.773320
17       A          X  cvksn     b  0.791990
18       B          Y  spghx     b  0.487480
19       B          Y  cqokn     b  0.252534
20       B          Y  wjsxg     b  0.237767
21       B          Z  holoo     b  0.432981
22       B          Z  lnihf     b  0.317932
23       B          Z  wirhq     b  0.614401

我想在section + subsection + name + topic加上topic上的unstack进行分组,但也显示间歇性嵌套的“All”小计行:

>>> result                                                                                                                                         
   section subsection   name         a         b
0        A        All    All  0.490341  0.573542
1        A          W    All  0.524608  0.575909
2        A          W  jcyyc  0.404077  0.507962
3        A          W  kucem  0.367319  0.546055
4        A          W  zwphf  0.802427  0.673711
5        A          X    All  0.456074  0.571174
6        A          X  cvksn  0.548099  0.791990
7        A          X  ldbxz  0.554260  0.148214
8        A          X  vkcqh  0.265864  0.773320
9        B        All    All  0.522005  0.390516
10       B          Y    All  0.621808  0.325927
11       B          Y  cqokn  0.577504  0.252534
12       B          Y  spghx  0.472612  0.487480
13       B          Y  wjsxg  0.815309  0.237767
14       B          Z    All  0.422202  0.455104
15       B          Z  holoo  0.459850  0.432981
16       B          Z  lnihf  0.667877  0.317932
17       B          Z  wirhq  0.138879  0.614401

使用突出显示的新行可能更容易显示:

enter image description here

初始groupby本身没有小计,看起来像:

>>> df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic')
topic                            a         b
section subsection name                     
A       W          jcyyc  0.404077  0.507962
                   kucem  0.367319  0.546055
                   zwphf  0.802427  0.673711
        X          cvksn  0.548099  0.791990
                   ldbxz  0.554260  0.148214
                   vkcqh  0.265864  0.773320
B       Y          cqokn  0.577504  0.252534
                   spghx  0.472612  0.487480
                   wjsxg  0.815309  0.237767
        Z          holoo  0.459850  0.432981
                   lnihf  0.667877  0.317932
                   wirhq  0.138879  0.614401

但我不确定如何使用margins来获取['section', 'topic']['section', 'subsection', 'topic']上的groupby操作的小计。


要重新创建df

import pandas as pd
data = [['A', 'W', 'zwphf', 'a', 0.80242702],
        ['A', 'W', 'jcyyc', 'a', 0.40407741],
        ['A', 'W', 'kucem', 'a', 0.36731944],
        ['A', 'X', 'ldbxz', 'a', 0.55426007],
        ['A', 'X', 'vkcqh', 'a', 0.26586396],
        ['A', 'X', 'cvksn', 'a', 0.54809939],
        ['B', 'Y', 'spghx', 'a', 0.47261223],
        ['B', 'Y', 'cqokn', 'a', 0.57750357],
        ['B', 'Y', 'wjsxg', 'a', 0.81530899],
        ['B', 'Z', 'holoo', 'a', 0.45985020],
        ['B', 'Z', 'lnihf', 'a', 0.66787651],
        ['B', 'Z', 'wirhq', 'a', 0.13887864],
        ['A', 'W', 'zwphf', 'b', 0.67371101],
        ['A', 'W', 'jcyyc', 'b', 0.50796174],
        ['A', 'W', 'kucem', 'b', 0.54605544],
        ['A', 'X', 'ldbxz', 'b', 0.14821402],
        ['A', 'X', 'vkcqh', 'b', 0.77331968],
        ['A', 'X', 'cvksn', 'b', 0.79198960],
        ['B', 'Y', 'spghx', 'b', 0.48747995],
        ['B', 'Y', 'cqokn', 'b', 0.25253355],
        ['B', 'Y', 'wjsxg', 'b', 0.23776694],
        ['B', 'Z', 'holoo', 'b', 0.43298050],
        ['B', 'Z', 'lnihf', 'b', 0.31793156],
        ['B', 'Z', 'wirhq', 'b', 0.61440056]]
df = pd.DataFrame(data,
                  columns=['section', 'subsection', 'name', 'topic', 'score'])

要重新创建预期结果:

import numpy as np

result = np.array([['A', 'All', 'All', 0.490341219, 0.573541919],
                   ['A', 'W', 'All', 0.52460796, 0.5759094],
                   ['A', 'W', 'jcyyc', 0.404077415, 0.5079617479999999],
                   ['A', 'W', 'kucem', 0.36731944, 0.546055442],
                   ['A', 'W', 'zwphf', 0.8024270240000001, 0.673711011],
                   ['A', 'X', 'All', 0.45607447700000003, 0.571174437],
                   ['A', 'X', 'cvksn', 0.548099391, 0.791989603],
                   ['A', 'X', 'ldbxz', 0.554260074, 0.148214029],
                   ['A', 'X', 'vkcqh', 0.265863967, 0.77331968],
                   ['B', 'All', 'All', 0.5220050279999999, 0.390515513],
                   ['B', 'Y', 'All', 0.621808268, 0.325926816],
                   ['B', 'Y', 'cqokn', 0.577503576, 0.252533557],
                   ['B', 'Y', 'spghx', 0.472612233, 0.487479951],
                   ['B', 'Y', 'wjsxg', 0.815308995, 0.237766941],
                   ['B', 'Z', 'All', 0.42220178799999997, 0.455104209],
                   ['B', 'Z', 'holoo', 0.459850205, 0.43298050200000004],
                   ['B', 'Z', 'lnihf', 0.667876511, 0.317931565],
                   ['B', 'Z', 'wirhq', 0.13887864800000002, 0.61440056]], dtype=object)
result = pd.DataFrame(result, columns=['section', 'subsection', 'name', 'a', 'b'])
python pandas
2个回答
2
投票

你需要:

s = df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic')

s1 = (s.mean(level=0)
       .assign(subsection = 'All', name='All')
       .set_index(['subsection','name'], append=True))
s2 = (s.mean(level=[0, 1])
       .assign(name='All')
       .set_index(['name'], append=True))

s = pd.concat([s, s1, s2]).sort_index()

但如果需要submeans不确定上面的解决方案是否正确(手段的意思),更好的是:

s1 = df.groupby(['section','topic'])['score'].mean().unstack('topic').assign(subsection = 'All', name='All').set_index(['subsection','name'], append=True)
s2 = df.groupby(['section','subsection','topic'])['score'].mean().unstack('topic').assign(name='All').set_index(['name'], append=True)

s = pd.concat([s, s1, s2]).sort_index()
print (s)
topic                            a         b
section subsection name                     
A       All        All    0.490341  0.573542
        W          All    0.524608  0.575909
                   jcyyc  0.404077  0.507962
                   kucem  0.367319  0.546055
                   zwphf  0.802427  0.673711
        X          All    0.456074  0.571174
                   cvksn  0.548099  0.791990
                   ldbxz  0.554260  0.148214
                   vkcqh  0.265864  0.773320
B       All        All    0.522005  0.390516
        Y          All    0.621808  0.325927
                   cqokn  0.577504  0.252534
                   spghx  0.472612  0.487480
                   wjsxg  0.815309  0.237767
        Z          All    0.422202  0.455104
                   holoo  0.459850  0.432980
                   lnihf  0.667877  0.317932
                   wirhq  0.138879  0.614401

编辑:

如果有必要订购 - 这里qazxsw poi而不是tot可能使用qazxsw poi:

All

ordered categoricals

1
投票

您可以执行以下操作,首先我们创建三个cat1 = ['tot'] + df['subsection'].unique().tolist() cat2 = ['tot'] + df['name'].unique().tolist() df['subsection'] = pd.Categorical(df['subsection'], categories=cat1, ordered=True) df['name'] = pd.Categorical(df['name'], categories=cat2, ordered=True) s = df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic') s1 = (df.groupby(['section','topic'])['score'].mean() .unstack('topic').assign(subsection = 'tot', name='tot') .set_index(['subsection','name'], append=True)) s2 = (df.groupby(['section','subsection','topic'])['score'].mean() .unstack('topic') .assign(name='tot') .set_index(['name'], append=True)) s = pd.concat([s, s1, s2]).sort_index() 数据帧。然后我们将它们连接起来并用print (s) topic a b section subsection name A tot tot 0.490341 0.573542 W tot 0.524608 0.575909 zwphf 0.802427 0.673711 jcyyc 0.404077 0.507962 kucem 0.367319 0.546055 X tot 0.456074 0.571174 ldbxz 0.554260 0.148214 vkcqh 0.265864 0.773320 cvksn 0.548099 0.791990 B tot tot 0.522005 0.390516 Y tot 0.621808 0.325927 spghx 0.472612 0.487480 cqokn 0.577504 0.252534 wjsxg 0.815309 0.237767 Z tot 0.422202 0.455104 holoo 0.459850 0.432980 lnihf 0.667877 0.317932 wirhq 0.138879 0.614401 填充groupby

NaN

重新排序列以更正输出:

All

注意我在df1 = df.groupby(['section', 'subsection', 'name', 'topic']).score.mean().unstack('topic').reset_index() df2 = df.groupby(['section', 'subsection', 'topic']).score.mean().unstack('topic').reset_index() df3 = df.groupby(['section', 'topic']).score.mean().unstack('topic').reset_index() df_final = pd.concat([df1,df2,df3],axis=0, ignore_index=True, sort=True) \ .fillna('All').sort_values(['section', 'subsection','name']) \ .reset_index(drop=True) print(df_final) a b name section subsection 0 0.490341 0.573542 All A All 1 0.524608 0.575909 All A W 2 0.404077 0.507962 jcyyc A W 3 0.367319 0.546055 kucem A W 4 0.802427 0.673711 zwphf A W 5 0.456074 0.571174 All A X 6 0.548099 0.791990 cvksn A X 7 0.554260 0.148214 ldbxz A X 8 0.265864 0.773320 vkcqh A X 9 0.522005 0.390516 All B All 10 0.621808 0.325927 All B Y 11 0.577504 0.252534 cqokn B Y 12 0.472612 0.487480 spghx B Y 13 0.815309 0.237767 wjsxg B Y 14 0.422202 0.455104 All B Z 15 0.459850 0.432980 holoo B Z 16 0.667877 0.317932 lnihf B Z 17 0.138879 0.614401 wirhq B Z 之前使用cols = ['section', 'subsection', 'name', 'a', 'b'] df_final = df_final[cols] print(df_final) section subsection name a b 0 A All All 0.490341 0.573542 1 A W All 0.524608 0.575909 2 A W jcyyc 0.404077 0.507962 3 A W kucem 0.367319 0.546055 4 A W zwphf 0.802427 0.673711 5 A X All 0.456074 0.571174 6 A X cvksn 0.548099 0.791990 7 A X ldbxz 0.554260 0.148214 8 A X vkcqh 0.265864 0.773320 9 B All All 0.522005 0.390516 10 B Y All 0.621808 0.325927 11 B Y cqokn 0.577504 0.252534 12 B Y spghx 0.472612 0.487480 13 B Y wjsxg 0.815309 0.237767 14 B Z All 0.422202 0.455104 15 B Z holoo 0.459850 0.432980 16 B Z lnihf 0.667877 0.317932 17 B Z wirhq 0.138879 0.614401 来确保排序提供正确的输出

热门问题
推荐问题
最新问题