pandas-groupby子群的频率计算，插入新行和重新排列列。

Question

我需要一些帮助，在子群上执行一些操作，但我真的很困惑。我将试着快速描述这些操作和所需的输出与注释。

(1)计算每个子群的出现频率%。

(2)出现0的不存在的记录。

(3)重新安排记录和列的顺序。

假设下面的df为原始数据。

df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
                 'branch':['A','A','C','C','C','C','A','A','C','A'],
                 'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})

下面的grouped_df与我的想法很接近但我不能得到理想的输出。

grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})

# output:
products      accessories  bags  clothes  shoes
store branch                                   
1     A               0.0   0.0      1.0    1.0
      C               0.0   0.0      1.0    0.0
2     C               1.0   0.0      1.0    1.0
3     A               0.0   2.0      1.0    0.0
      C               0.0   0.0      1.0    0.0

# desirable output: if (1), (2) and (3) take place somehow...
products      clothes  shoes  accessories  bags
store branch                                   
1     B             0      0            0     0  #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
      A          33.3   33.3            0     0
      C          33.3    0.0            0     0
2     B             0      0            0     0
      A             0      0            0     0
      C          33.3   33.3         33.3     0
3     B             0      0            0     0  #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
      A            25      0            0    50
      C            25      0            0     0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above

我试着分别处理每一个组，但i)它没有考虑到被替换的NaN值，ii)我应该避免处理每一个组，因为我将需要在之后连接很多组(这个df只是一个例子)，因为我将需要在以后绘制整个组。

grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products      accessories  bags  clothes  shoes
store branch                                   
1     A               NaN   NaN     50.0  100.0  #why has it transformed on axis='columns'?
      C               NaN   NaN     50.0    0.0

希望我的问题能说得通。任何对我尝试执行的见解都非常感激，提前，非常感谢!

Answer 1

在以下的帮助下 @Quang Hoang 在我发布答案的前一天，试图帮助解决这个问题的人，我设法找到了一个解决方案。

解释一下最后一点计算，我将每个元素进行了变换，用每组的计数之和除以0th-level-group-wise，而不是rowcolumntotal-wise，找到每个元素的频率。

grouped_df = df.groupby(['store', 'branch', 'products']).size()\
    .unstack('branch')\
        .reindex(['B','C','A'], axis=1, fill_value=0)\
            .stack('branch')\
                .unstack('products')\
                    .replace({np.nan:0})\
                        .transform(
                            lambda x: x*100/df.groupby(['store']).size()
                                   ).round(1)\
                            .reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')

运行上面这段代码，就会产生想要的输出。

products      accessories  bags  clothes  shoes
store branch                                   
1     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     33.3    0.0
      A               0.0   0.0     33.3   33.3
2     B               0.0   0.0      0.0    0.0
      C              33.3   0.0     33.3   33.3
3     B               0.0   0.0      0.0    0.0
      C               0.0   0.0     25.0    0.0
      A               0.0  50.0     25.0    0.0

pandas-groupby子群的频率计算，插入新行和重新排列列。

问题描述投票：0回答：1

1个回答

最新问题

pandas-groupby子群的频率计算，插入新行和重新排列列。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1