假设我们有一个 DataFrame:
data = {'person_id': ['person_a', 'person_a', 'person_b','person_b', 'person_c','person_c'],
'categorical_data': ['new', 'new', 'ok', 'bad', 'new', 'bad']}
df = pd.DataFrame(data)
person_id categorical_data
0 person_a new
1 person_a new
2 person_b ok
3 person_b bad
4 person_c new
5 person_c bad
我想将分类数据扩展到多列,其中包含每个类别的计数。
我们可以按人员 ID 进行分组来获取计数:
count_categories = df.groupby('person_id')['categorical_data'].value_counts().reset_index(name='count')
person_id categorical_data count
0 person_a new 2
1 person_b bad 1
2 person_b ok 1
3 person_c bad 1
4 person_c new 1
然后我尝试这样做来创建新列:
pivoted = count_categories.set_index(['person_id','categorical_data']).unstack('categorical_data')
count
categorical_data bad new ok
person_id
person_a NaN 2.0 NaN
person_b 1.0 NaN 1.0
person_c 1.0 1.0 NaN
这是我想要的形式,但我对多重索引感到困惑
我怎样才能摆脱索引,或者有更好的方法来做到这一点?尝试重置索引产量:
pivoted.reset_index()
person_id count
categorical_data bad new ok
0 person_a NaN 2.0 NaN
1 person_b 1.0 NaN 1.0
2 person_c 1.0 1.0 NaN
代码
使用
crosstab
out = pd.crosstab(df['person_id'], df['categorical_data'])
出
categorical_data bad new ok
person_id
person_a 0 2 0
person_b 1 0 1
person_c 1 1 0
或者
out1 = (pd.crosstab(df['person_id'], df['categorical_data'])
.reset_index()
.rename_axis(None, axis=1)
)
输出1
person_id bad new ok
0 person_a 0 2 0
1 person_b 1 0 1
2 person_c 1 1 0