如何优化代码以在python pandas中透视表

问题描述 投票:3回答:2

我在Python熊猫中创建了一个DataFrame,它使用四个匹配的字符串(type_1,type_2,type_3和type_4)将公司(A,B,C)匹配到record_id。看起来像这样:

    vendor match_type  record_id  percent     cumulative_percent
0     A      type_1      2974     26.348897   26.348897
1     A      type_2       275     2.436431    28.785328
2     A      type_3       214     1.895987    30.681315
3     A      type_4      2341     20.740675   51.421990
4     B      type_1       440     3.898290    55.320280
5     B      type_2        39     0.345530    55.665810
6     B      type_3        54     0.478427    56.144237
7     B      type_4       596     5.280411    61.424648
8     C      type_1       399     3.535040    64.959688
9     C      type_2        70     0.620183    65.579871
10    C      type_3        44     0.389829    65.969700
11    C      type_4       262     2.321255    68.290954
12   NaN      NaN        3579     31.709046   100.000000

位置:

  • record_id列包含匹配的record_id的数量
  • 第12行代表与公司A,B或C中的任何记录都不匹配的记录
  • percent代表每行匹配的record_id的数量除以record_id的总数,
  • cumulative_percent只是百分比的运行总计。

我想使表格看起来像这样:

match_type    type_1  type_2  type_3  type_4  No Match  Grand Total  percent  cumulative percent
vendor                              
  A            2974    275     214     2341              5804          51.4%      51.4%
  B             440     39      54      596              1129          10.0%      61.4%
  C             399     70      44      262               775           6.9%      68.3%
 NaN                                            3579     3579          31.7%     100.0%
Grand Total    3813    384     312     3199     3579    11287         100.0%    

问题是执行枢轴操作需要花费大量代码。我无法在ivot_table命令中包含percent和cumulative_percent列,因此必须重新计算它们。我还必须重新排序列和行。

谁能告诉我如何将其优化为更少的Python代码行?这是我为获取上面显示的数据透视表编写的代码:

tbl = pd.pivot_table(df, values ="record_id", index ="vendor", columns ="match_type", 
                       aggfunc = np.sum, fill_value="", margins=True, margins_name="Grand Total")
column_order=["type_1", "type_2", "type_3", "type_4", "NaN", "Grand Total"]
tbl = tbl.reindex(column_order, axis=1)
tbl.rename(columns={"NaN":"No Match"}, inplace=True)
row_order = ["A", "B", "C", "NaN", "Grand Total"]
tbl = tbl.reindex(row_order, axis=0)
total=sum(tbl["Grand Total"][0:4])
tbl["percent"]=round(tbl["Grand Total"]/total * 100.0, 1)
tbl["cumulative percent"]=tbl.percent[0:4].cumsum()
tbl.percent=tbl.percent.astype(str) + "%"
tbl["cumulative percent"]=tbl["cumulative percent"].astype(str) + "%"
tbl["cumulative percent"].iloc[4]=""
tbl

提前感谢。

python pandas optimization pivot-table
2个回答
1
投票

这里是使用pd.crosstab的另一种方法:

df = df.fillna('XXX')
crosstab = pd.crosstab(df['vendor'], 
                       df['match_type'], 
                       df['record_i'], 
                       aggfunc='sum', 
                       margins=True, 
                       margins_name='Grand Total')

piv = crosstab.join(df.groupby('vendor')['percent'].sum())
piv['cumulative_percent'] = piv['percent'].cumsum()
piv = piv.rename(columns={'XXX':'No Match'}).rename(index={'XXX':np.NaN}).fillna('')

            No Match type_1 type_2 type_3 type_4  Grand Total  percent  \
vendor                                                                   
A                      2974    275    214   2341         5804   51.422   
B                       440     39     54    596         1129  10.0027   
C                       399     70     44    262          775  6.86631   
NaN             3579                                     3579   31.709   
Grand Total     3579   3813    384    312   3199        11287            

            cumulative_percent  
vendor                          
A                       51.422  
B                      61.4246  
C                       68.291  
NaN                        100  
Grand Total                    

0
投票

从解决方案的关键要继续:

df = (df.replace(np.nan,'NaN')
      .pivot_table(index='vendor', 
                   values='record_id', 
                   columns='match_type',
                   aggfunc=np.sum,
                   fill_value='',
                   margins=True,
                   margins_name='Grand Total')
     .rename(columns={'NaN':'No Match'})
     .assign(percent = lambda x: x['Grand Total']
                                  .iloc[:4].div(x['Grand Total'].iloc[-1]))
     .assign(percent = lambda x: (x.percent*100).round(1),
             cumulative = lambda x: x.percent.cumsum().fillna(''))
     .filter(items=['type_1',
                    'type_2',
                    'type_3',
                    'type_4',
                    'No Match',
                    'Grand Total',
                    'percent',
                    'cumulative'])
      )

df.loc['Grand Total','percent'] = df['percent'].sum()

df

match_type  type_1 type_2 type_3 type_4 No Match  Grand Total  percent  \
vendor                                                                   
A             2974    275    214   2341                  5804     51.4   
B              440     39     54    596                  1129     10.0   
C              399     70     44    262                   775      6.9   
NaN                                         3579         3579     31.7   
Grand Total   3813    384    312   3199     3579        11287    100.0   

match_type  cumulative  
vendor                  
A                 51.4  
B                 61.4  
C                 68.3  
NaN                100  
Grand Total             
© www.soinside.com 2019 - 2024. All rights reserved.