查找由另一列分组的列的唯一性

Question

我正在尝试对我正在使用的代码库中以下函数的功能进行逆向工程：

def _helper(df):
  return (df.groupby(['a', 'b', 'c'])
          .size()
          .reset_index()
          .rename(columns={0: 'count'}))

def nunique_counts(df, col):
  df   = _helper(df)
  data = df[['a', col]].groupby('a').nunique()[col]
  return data

我一直认为

nunique_counts

返回列

col

的每个唯一值的唯一

值的数量，但我不确定实际情况是否如此，因为我们可以简单地这样做，我认为：

def nunique_counts(df, col):
  return df.groupby('a')[col].nunique()

原始函数试图做什么？另外，在

count

中创建的

_helper

列似乎没有被使用，而且，

groupby

对列

的依赖不应该影响最终结果，所以我不确定为什么它是均匀的完成了——尽管我对此并不是 100% 有信心。

根据我正在处理的问题的经验，两者似乎没有产生相同的结果。具体来说，第二个产生的计数为 <= the former one

针对@Quang的评论

In [2]: def _helper(df): 
   ...:   return (df.groupby(['a', 'b', 'c']) 
   ...:           .size() 
   ...:           .reset_index() 
   ...:           .rename(columns={0: 'count'})) 
   ...:                                                                             

In [3]: import pandas as pd                                                         

In [4]: df = pd.DataFrame({"a": [0, 1, 2, 3], "b": [0,0,1,1], "c":[0,0,0,0]})       

In [5]: _helper(df)                                                                 
Out[5]: 
   a  b  c  count
0  0  0  0      1
1  1  0  0      1
2  2  1  0      1
3  3  1  0      1

In [6]: df = pd.DataFrame({"a": [0, 0, 1, 1], "b": [0,0,1,1], "c":[0,0,0,0]})       

In [7]: _helper(df)                                                                 
Out[7]: 
   a  b  c  count
0  0  0  0      2
1  1  1  0      2

事实证明，

nan

列中的

值是罪魁祸首。考虑以下几点：

In [23]: df = pd.DataFrame({"a": [0, 0, 2, 3], "b": [0,0,1,1], "c": [np.nan,np.nan,0
    ...: ,0]})                                                                      

In [24]: df                                                                         
Out[24]: 
   a  b    c
0  0  0  NaN
1  0  0  NaN
2  2  1  0.0
3  3  1  0.0

In [25]: nunique_counts(df, "b")                                                    
Out[25]: 
a
2    1
3    1
Name: b, dtype: int64

In [26]: nunique_counts1(df, "b")    # I defined nunique_counts1 to be the modified implementation                                                
Out[26]: 
a
0    1
2    1
3    1
Name: b, dtype: int64

Answer 1

_helper

函数聚合 a/b/c 的唯一组合，并添加一个新列，其中包含每组的行数（删除该过程中的任何其他列），并删除 a/b/c 中带有 NaN 的行因为

dropna

的

groupby 参数默认为

True

。

例如：

df = pd.DataFrame({"a": [0,2,2,2,1], "b": [0,0,1,1,2], "c":[0,0,0,0,None]})

   a  b    c
0  0  0  0.0
1  2  0  0.0
2  2  1  0.0  # duplicated
3  2  1  0.0  #
4  1  2  NaN      # this will be dropped

_helper(df)

   a  b    c  count
0  0  0  0.0      1
1  2  0  0.0      1
2  2  1  0.0      2  # 2 duplicated rows

因此，

_helper

的作用或多或少类似于

dropna

，并且按照建议，您应该能够使用单个函数替换原始函数：

df.dropna(subset=['a', 'b', 'c']).groupby('a')[col].nunique()

除非传递给

col

的值是
col='count'
，这当然是

_helper

创建的列。在这种情况下，您需要原始功能。

查找由另一列分组的列的唯一性

问题描述投票：0回答：1

1个回答

最新问题

查找由另一列分组的列的唯一性

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1