Python，pandas数据框架，groupby列和预先已知的值。

Question

考虑这个例子。

>>> import pandas as pd
>>> df = pd.DataFrame(
...     [
...         ['X', 'R', 1],
...         ['X', 'G', 2],
...         ['X', 'R', 1],
...         ['X', 'B', 3],
...         ['X', 'R', 2],
...         ['X', 'B', 2],
...         ['X', 'G', 1],
...     ],
...     columns=['client', 'status', 'cnt']
... )
>>> df
  client status  cnt
0      X      R    1
1      X      G    2
2      X      R    1
3      X      B    3
4      X      R    2
5      X      B    2
6      X      G    1
>>>
>>> df_gb = df.groupby(['client', 'status']).cnt.sum().unstack()
>>> df_gb
status  B  G  R
client
X       5  3  4
>>>
>>> def color(row):
...     if 'R' in row:
...         red = row['R']
...     else:
...         red = 0
...     if 'B' in row:
...         blue = row['B']
...     else:
...         blue = 0
...     if 'G' in row:
...         green = row['G']
...     else:
...         green = 0
...     if red > 0:
...         return 'red'
...     elif blue > 0 and (red + green) == 0:
...         return 'blue'
...     elif green > 0 and (red + blue) == 0:
...         return 'green'
...     else:
...         return 'orange'
...
>>> df_gb.apply(color, axis=1)
client
X    red
dtype: object
>>>

这段代码所做的，是通过groupby来获得每个类别（红、绿、蓝）的数量，而apply则是用来实现确定每个客户端（本例中只有一个）的颜色的逻辑。

这里的问题是，groupby对象可以包含任何RGB值的组合，例如，我可以有R和G列，但没有B列，或者我可以只有R列，或者我没有任何RGB颜色。

正因为如此，在应用函数中，我不得不为每一列引入if语句，以便为每一种颜色计数，无论其值是否在groupby对象中。

我是否有其他的选择来强制执行color函数的逻辑，使用其他的东西来代替apply这种（丑陋的）方式？

例如，在这种情况下，我事先知道我需要三个类别的计数--R、G和B，我需要类似于按列和这三个值分组的东西。

我能否按这三个类别对数据框进行分组(series、dict、function?)，无论这三个类别是否存在于组中，总能得到零或和？

Answer 1

使用。

#changed data for more combinations

df = pd.DataFrame(
    [
        ['W', 'R', 1],
        ['X', 'G', 2],
        ['Y', 'R', 1],
        ['Y', 'B', 3],
        ['Z', 'R', 2],
        ['Z', 'B', 2],
        ['Z', 'G', 1],
     ],
     columns=['client', 'status', 'cnt']
)
print (df)
  client status  cnt
0      W      R    1
1      X      G    2
2      Y      R    1
3      Y      B    3
4      Z      R    2
5      Z      B    2
6      Z      G    1

Then is added fill_value=0 替换非匹配值（缺失值）的参数为 0:

df_gb = df.groupby(['client', 'status']).cnt.sum().unstack(fill_value=0)
#alternative
df_gb = df.pivot_table(index='client', 
                       columns='status', 
                       values='cnt', 
                       aggfunc='sum', 
                       fill_value=0)
print (df_gb)
status  B  G  R
client         
W       0  0  1
X       0  2  0
Y       3  0  1
Z       2  1  2

取而代之的是，该函数为所有0,1的组合创建了辅助数据框架，并为其添加了新的列。output:

from  itertools import product

df1 = pd.DataFrame(product([0,1], repeat=3), columns=['R','G','B'])
#change colors like need
df1['output'] = ['no','blue','green','color2','red','red1','red2','all']
print (df1)
   R  G  B  output
0  0  0  0      no
1  0  0  1    blue
2  0  1  0   green
3  0  1  1  color2
4  1  0  0     red
5  1  0  1    red1
6  1  1  0    red2
7  1  1  1     all

那么对于替换值以上的 1 到 1 是用 DataFrame.clip:

print (df_gb.clip(upper=1))
   B  G  R output
0  0  0  1    red
1  0  1  0  green
2  1  0  1   red1
3  1  1  1    all

最后一次是用的。DataFrame.merge 对于新的输出列，没有 on 参数，所以通过两个DataFrames中的列的交叉点来加入，这里指的是 R,G,B:

df2 = df_gb.clip(upper=1).merge(df1)
print (df2)
   B  G  R output
0  0  0  1    red
1  0  1  0  green
2  1  0  1   red1
3  1  1  1    all

Python，pandas数据框架，groupby列和预先已知的值。

问题描述投票：0回答：1

1个回答

最新问题

Python，pandas数据框架，groupby列和预先已知的值。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1