我在Python熊猫中创建了一个DataFrame,它使用四个匹配的字符串(type_1,type_2,type_3和type_4)将公司(A,B,C)匹配到record_id。看起来像这样:
vendor match_type record_id percent cumulative_percent
0 A type_1 2974 26.348897 26.348897
1 A type_2 275 2.436431 28.785328
2 A type_3 214 1.895987 30.681315
3 A type_4 2341 20.740675 51.421990
4 B type_1 440 3.898290 55.320280
5 B type_2 39 0.345530 55.665810
6 B type_3 54 0.478427 56.144237
7 B type_4 596 5.280411 61.424648
8 C type_1 399 3.535040 64.959688
9 C type_2 70 0.620183 65.579871
10 C type_3 44 0.389829 65.969700
11 C type_4 262 2.321255 68.290954
12 NaN NaN 3579 31.709046 100.000000
位置:
我想使表格看起来像这样:
match_type type_1 type_2 type_3 type_4 No Match Grand Total percent cumulative percent
vendor
A 2974 275 214 2341 5804 51.4% 51.4%
B 440 39 54 596 1129 10.0% 61.4%
C 399 70 44 262 775 6.9% 68.3%
NaN 3579 3579 31.7% 100.0%
Grand Total 3813 384 312 3199 3579 11287 100.0%
问题是执行枢轴操作需要花费大量代码。我无法在ivot_table命令中包含percent和cumulative_percent列,因此必须重新计算它们。我还必须重新排序列和行。
谁能告诉我如何将其优化为更少的Python代码行?这是我为获取上面显示的数据透视表编写的代码:
tbl = pd.pivot_table(df, values ="record_id", index ="vendor", columns ="match_type",
aggfunc = np.sum, fill_value="", margins=True, margins_name="Grand Total")
column_order=["type_1", "type_2", "type_3", "type_4", "NaN", "Grand Total"]
tbl = tbl.reindex(column_order, axis=1)
tbl.rename(columns={"NaN":"No Match"}, inplace=True)
row_order = ["A", "B", "C", "NaN", "Grand Total"]
tbl = tbl.reindex(row_order, axis=0)
total=sum(tbl["Grand Total"][0:4])
tbl["percent"]=round(tbl["Grand Total"]/total * 100.0, 1)
tbl["cumulative percent"]=tbl.percent[0:4].cumsum()
tbl.percent=tbl.percent.astype(str) + "%"
tbl["cumulative percent"]=tbl["cumulative percent"].astype(str) + "%"
tbl["cumulative percent"].iloc[4]=""
tbl
提前感谢。
这里是使用pd.crosstab
的另一种方法:
df = df.fillna('XXX')
crosstab = pd.crosstab(df['vendor'],
df['match_type'],
df['record_i'],
aggfunc='sum',
margins=True,
margins_name='Grand Total')
piv = crosstab.join(df.groupby('vendor')['percent'].sum())
piv['cumulative_percent'] = piv['percent'].cumsum()
piv = piv.rename(columns={'XXX':'No Match'}).rename(index={'XXX':np.NaN}).fillna('')
No Match type_1 type_2 type_3 type_4 Grand Total percent \
vendor
A 2974 275 214 2341 5804 51.422
B 440 39 54 596 1129 10.0027
C 399 70 44 262 775 6.86631
NaN 3579 3579 31.709
Grand Total 3579 3813 384 312 3199 11287
cumulative_percent
vendor
A 51.422
B 61.4246
C 68.291
NaN 100
Grand Total
从解决方案的关键要继续:
df = (df.replace(np.nan,'NaN')
.pivot_table(index='vendor',
values='record_id',
columns='match_type',
aggfunc=np.sum,
fill_value='',
margins=True,
margins_name='Grand Total')
.rename(columns={'NaN':'No Match'})
.assign(percent = lambda x: x['Grand Total']
.iloc[:4].div(x['Grand Total'].iloc[-1]))
.assign(percent = lambda x: (x.percent*100).round(1),
cumulative = lambda x: x.percent.cumsum().fillna(''))
.filter(items=['type_1',
'type_2',
'type_3',
'type_4',
'No Match',
'Grand Total',
'percent',
'cumulative'])
)
df.loc['Grand Total','percent'] = df['percent'].sum()
df
match_type type_1 type_2 type_3 type_4 No Match Grand Total percent \
vendor
A 2974 275 214 2341 5804 51.4
B 440 39 54 596 1129 10.0
C 399 70 44 262 775 6.9
NaN 3579 3579 31.7
Grand Total 3813 384 312 3199 3579 11287 100.0
match_type cumulative
vendor
A 51.4
B 61.4
C 68.3
NaN 100
Grand Total