计算Pandas中类别的百分比

问题描述 投票:1回答:3

我有一个数据帧train,我已经从train数据帧中过滤了一定数量的行,以形成promoted数据帧:

print(train.department.value_counts(),'\n')
promoted=train[train.is_promoted==1]
print(promoted.department.value_counts())

上面代码的输出是:

Sales & Marketing    16840
Operations           11348
Technology            7138
Procurement           7138
Analytics             5352
Finance               2536
HR                    2418
Legal                 1039
R&D                    999
Name: department, dtype: int64

Sales & Marketing    1213
Operations           1023
Technology            768
Procurement           688
Analytics             512
Finance               206
HR                    136
R&D                    69
Legal                  53
Name: department, dtype: int64

我想显示train数据框中promoted出现的每个类别部门的百分比,即代替数字1213,1023,768,688等。我应该得到一个百分比,如:1213/16840 * 100 = 7.2等。请注意,我不想要标准化值。

python pandas series
3个回答
1
投票

尝试:

promoted.department.value_counts()/train.department.value_counts()*100

它应该给你想要的输出:

Sales & Marketing    7.2030
Operations           9.0148
Technology          10.7593 
.....                 ...
Name: department, dtype: int64

1
投票

这个怎么样?示例有一个玩具数据集,但关键的想法是简单地将一个值计数除以另一个。

import pandas as pd
import numpy as np

data = pd.DataFrame({
    'department': list(range(10)) * 100,
    'is_promoted': np.random.randint(0, 2, size =  1000)
})

# Slice out promoted data.

data_promoted = data[data['is_promoted'] == 1]

# Calculate share of each department that is present in data_promoted.

data_promoted['department'].value_counts().sort_index() / data['department'].value_counts().sort_index()

得到:

0    0.50
1    0.52
2    0.45
3    0.54
4    0.41
5    0.50
6    0.45
7    0.52
8    0.60
9    0.52
Name: department, dtype: float64

0
投票
import pandas as pd
df = pd.read_csv("/home/spaceman/my_work/Most-Recent-Cohorts-Scorecard-Elements.csv")
df=df[['STABBR']] #each values is appearing in dataframe with multiple 
#after that i got  
CA    717
TX    454
NY    454
FL    417
PA    382
OH    320
IL    280
MI    189
NC    189
.........
.........

print df['STABBR'].value_counts(normalize=True) #returns the relative 
frequency by dividing all values by the sum of values
CA    0.099930
TX    0.063275
NY    0.063275
FL    0.058118
PA    0.053240
OH    0.044599
IL    0.039024
MI    0.026341
NC    0.026341
..............
..............
© www.soinside.com 2019 - 2024. All rights reserved.