我有这个功能,它采用带有不同地区和国家预期寿命文章数据的数据框。我想统计每个地区的文章占所有文章的比例,以及统计每个地区关于男性和女性的文章的比例。我的问题是如何替换“for 循环”以便通过函数 calc_proportion 制作小数据框?该函数获取 Dataframe 中的所有唯一区域并计算每个区域的比例。
我想从函数 calc_proportion 获得这种数据框。
def calc_proportion(df):
proportions = pd.DataFrame(columns=['Region', 'Proportion_of_all_articles', 'Proportion_male_articles', 'Proportion_female_articles', 'Proportion_bs_articles'])
Regions = df.Region.unique()
for region in Regions:
a = f"{df.loc[df['Region'] == region].shape[0] / df.shape[0] : .0%}"
b = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Male')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
c = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Female')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
d = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Both sexes')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
proportions.loc[len(proportions)] = [region, a, b, c, d]
return proportions
calc_proportion(df)
所以我想在“out”中获取小比例的数据框,而不在函数中使用for循环。
import pandas as pd
import numpy as np
np.random.seed(0) # for reproducibility
regions = ['Africa', 'Americas', 'Eastern Mediterranean', 'Europe', 'South_East Asia']
sexes = ['Male', 'Female', 'Both sexes']
data = {'Region': np.random.choice(regions, 15),
'Sex': np.random.choice(sexes, 15)}
df = pd.DataFrame(data)
df
Region Sex
0 South_East Asia Female
1 Africa Female
2 Europe Female
3 Europe Female
4 Europe Male
5 Americas Female
6 Europe Male
7 Eastern Mediterranean Male
8 South_East Asia Female
9 Africa Both sexes
10 Africa Male
11 South_East Asia Both sexes
12 Eastern Mediterranean Male
13 Americas Female
14 Africa Female
这是一种方法:
df.groupby
并应用 groupby.value_counts
并将 normalize
参数设置为 True
以获得每个区域的分布。df.unstack
旋转第二个索引级别(带有“性别”)。df["Region"]
(Series.value_counts
)。我们使用 df.join
连接两个结果。df.fillna
以用 NaN
填充 0
值。df.rename
以更改列名称。df.loc
以所需的顺序获取列,并使用 df.reset_index
重置索引。代码
# dict for renaming col names at end
cols_rename = {'Region': 'Proportion_of_all_articles',
'Male': 'Proportion_male_articles',
'Female': 'Proportion_female_articles',
'Both sexes': 'Proportion_bs_articles'}
out = (df.groupby('Region')['Sex']
.value_counts(normalize=True)
.unstack('Sex')
.join(
df['Region'].value_counts(normalize=True)
)
.fillna(0)
.rename(columns=cols_rename)
.loc[:, cols_rename.values()]
.reset_index(drop=False)
)
结果
out
Region Proportion_of_all_articles \
0 Africa 0.266667
1 Americas 0.133333
2 Eastern Mediterranean 0.133333
3 Europe 0.266667
4 South_East Asia 0.200000
Proportion_male_articles Proportion_female_articles \
0 0.25 0.500000
1 0.00 1.000000
2 1.00 0.000000
3 0.50 0.500000
4 0.00 0.666667
Proportion_bs_articles
0 0.250000
1 0.000000
2 0.000000
3 0.000000
4 0.333333
格式化结果
看到您正在 Jupyter Notebook 中工作,我建议使用
df.style.format
打印以百分比形式显示浮点数的结果:
out.style.format({
col: lambda x: "{: .0f}%".format(x*100) for col in out.columns if 'Proportion' in col
})