如何替换pandas Dataframe中的for循环？

Question

我有这个功能，它采用带有不同地区和国家预期寿命文章数据的数据框。我想统计每个地区的文章占所有文章的比例，以及统计每个地区关于男性和女性的文章的比例。我的问题是如何替换“for 循环”以便通过函数 calc_proportion 制作小数据框？该函数获取 Dataframe 中的所有唯一区域并计算每个区域的比例。

我想从函数 calc_proportion 获得这种数据框。

def calc_proportion(df):
    proportions = pd.DataFrame(columns=['Region', 'Proportion_of_all_articles', 'Proportion_male_articles', 'Proportion_female_articles', 'Proportion_bs_articles'])
    Regions = df.Region.unique()
    for region in Regions:
        a = f"{df.loc[df['Region'] == region].shape[0] / df.shape[0] : .0%}"
        b = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Male')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        c = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Female')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        d = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Both sexes')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        proportions.loc[len(proportions)] = [region, a, b, c, d]
    return proportions

calc_proportion(df)

所以我想在“out”中获取小比例的数据框，而不在函数中使用for循环。

初始数据：

Answer 1

最小可重现示例

import pandas as pd
import numpy as np

np.random.seed(0) # for reproducibility
regions = ['Africa', 'Americas', 'Eastern Mediterranean', 'Europe', 'South_East Asia']
sexes = ['Male', 'Female', 'Both sexes']

data = {'Region': np.random.choice(regions, 15),
        'Sex': np.random.choice(sexes, 15)}

df = pd.DataFrame(data)

df

                   Region         Sex
0         South_East Asia      Female
1                  Africa      Female
2                  Europe      Female
3                  Europe      Female
4                  Europe        Male
5                Americas      Female
6                  Europe        Male
7   Eastern Mediterranean        Male
8         South_East Asia      Female
9                  Africa  Both sexes
10                 Africa        Male
11        South_East Asia  Both sexes
12  Eastern Mediterranean        Male
13               Americas      Female
14                 Africa      Female

这是一种方法：

在“区域”上使用
```
df.groupby
```
并应用
```
groupby.value_counts
```
并将
```
normalize
```
参数设置为
```
True
```
以获得每个区域的分布。
接下来，使用
```
df.unstack
```
旋转第二个索引级别（带有“性别”）。
对于“所有文章的比例”，我们需要将相同的 value_counts 直接应用于
```
df["Region"]
```
(
```
Series.value_counts
```
)。我们使用
```
df.join
```
连接两个结果。
剩下的都是装饰性的：
- 添加
```
df.fillna
```
  以用
```
NaN
```
  填充
```
0
```
  值。
- 添加
```
df.rename
```
  以更改列名称。
- 使用
```
df.loc
```
  以所需的顺序获取列，并使用
```
df.reset_index
```
  重置索引。

代码

# dict for renaming col names at end
cols_rename = {'Region': 'Proportion_of_all_articles',
               'Male': 'Proportion_male_articles',
               'Female': 'Proportion_female_articles',
               'Both sexes': 'Proportion_bs_articles'}

out = (df.groupby('Region')['Sex']
       .value_counts(normalize=True)
       .unstack('Sex')
       .join(
           df['Region'].value_counts(normalize=True)
           )
       .fillna(0)
       .rename(columns=cols_rename)
       .loc[:, cols_rename.values()]
       .reset_index(drop=False)
       )

结果

out

                  Region  Proportion_of_all_articles  \
0                 Africa                    0.266667   
1               Americas                    0.133333   
2  Eastern Mediterranean                    0.133333   
3                 Europe                    0.266667   
4        South_East Asia                    0.200000   

   Proportion_male_articles  Proportion_female_articles  \
0                      0.25                    0.500000   
1                      0.00                    1.000000   
2                      1.00                    0.000000   
3                      0.50                    0.500000   
4                      0.00                    0.666667   

   Proportion_bs_articles  
0                0.250000  
1                0.000000  
2                0.000000  
3                0.000000  
4                0.333333

格式化结果

看到您正在 Jupyter Notebook 中工作，我建议使用

df.style.format

打印以百分比形式显示浮点数的结果：

out.style.format({
    col: lambda x: "{: .0f}%".format(x*100) for col in out.columns if 'Proportion' in col
})

如何替换pandas Dataframe中的for循环？

问题描述投票：0回答：1

1个回答

最新问题

如何替换pandas Dataframe中的for循环？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1