在 pandas 中计算具有优先值的唯一值

Question

我有一个简单的数据框如下：

import pandas as pd
import numpy as np
df = pd.DataFrame({'CUS_NO': ['900636229', '900636229', '900636080', '900636080', '900636052', '900636052', 
                              '900636053', '900636054', '900636055', '900636056'], 
                   'indicator': ['both', 'left_only', 'both', 'left_only', 'both', 'left_only', 
                                 'both', 'left_only', 'both', 'left_only'],
                   'Nationality': ['VN', 'VN', 'KR', 'KR', 'VN', 'VN', 
                                   'KR', 'VN', 'KR', 'VN']})

        CUS_NO      indicator   Nationality
0       900636229   both        VN
1       900636229   left_only   VN
2       900636080   both        KR
3       900636080   left_only   KR
4       900636052   both        VN
5       900636052   left_only   VN
6       900636053   both        KR
7       900636054   left_only   VN
8       900636055   both        KR
9       900636056   left_only   VN

我想计算

CUS_NO

的唯一值，所以我通过以下代码使用了

pd.Series.nunique

：

df2 = pd.pivot_table(df, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True).reset_index()
df2

这是结果：

indicator   Nationality both    left_only   All
0           KR          3       1           3
1           VN          2       4           4
2           All         5       5           7

但我的期望是，如果

CUS_NO

相同而指标不同，我只需要计算

both

指标即可。所以下面是我的预期输出：

indicator   Nationality both    left_only   All
0           KR          3       0           3
1           VN          2       2           4
2           All         5       2           7

谢谢你。

Answer 1

您可以

sort_values

将“两者”放在顶部（如果有更多类别，请使用

Categorical

定义自定义顺序），然后

drop_duplicates

:

tmp = (df
   .sort_values(by='indicator')
   .drop_duplicates(subset=['CUS_NO', 'Nationality'], keep='first')
)

df2 = pd.pivot_table(tmp, values='CUS_NO', 
                     index='Nationality', 
                     columns='indicator', 
                     aggfunc=pd.Series.nunique, 
                     margins=True,
                     fill_value=0).reset_index()

输出：

indicator Nationality  both  left_only  All
0                  KR     3          0    3
1                  VN     2          2    4
2                 All     5          2    7

中级

tmp

:

      CUS_NO  indicator Nationality
0  900636229       both          VN
2  900636080       both          KR
4  900636052       both          VN
6  900636053       both          KR
8  900636055       both          KR
7  900636054  left_only          VN
9  900636056  left_only          VN

在 pandas 中计算具有优先值的唯一值

问题描述投票：0回答：1

1个回答

最新问题

在 pandas 中计算具有优先值的唯一值

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1