我需要一些帮助,在子群上执行一些操作,但我真的很困惑。我将试着快速描述这些操作和所需的输出与注释。
(1)计算每个子群的出现频率%。
(2)出现0的不存在的记录。
(3)重新安排记录和列的顺序。
假设下面的df为原始数据。
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
下面的grouped_df与我的想法很接近 但我不能得到理想的输出。
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
我试着分别处理每一个组,但i)它没有考虑到被替换的NaN值,ii)我应该避免处理每一个组,因为我将需要在之后连接很多组(这个df只是一个例子),因为我将需要在以后绘制整个组。
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
希望我的问题能说得通。任何对我尝试执行的见解都非常感激,提前,非常感谢!
在以下的帮助下 @Quang Hoang 在我发布答案的前一天,试图帮助解决这个问题的人,我设法找到了一个解决方案。
解释一下最后一点计算,我将每个元素进行了变换,用每组的计数之和除以0th-level-group-wise,而不是rowcolumntotal-wise,找到每个元素的频率。
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
运行上面这段代码,就会产生想要的输出。
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0