# 使用 groupby 获取组中具有最大值的行

``count``

``['Sp','Mt']``

``````   Sp   Mt Value   count
0  MM1  S1   a     **3**
1  MM1  S1   n       2
2  MM1  S3   cb    **5**
3  MM2  S3   mk    **8**
4  MM2  S4   bg    **10**
5  MM2  S4   dgd     1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2   uyi   **7**
``````

``````   Sp   Mt   Value  count
0  MM1  S1   a      **3**
2  MM1  S3   cb     **5**
3  MM2  S3   mk     **8**
4  MM2  S4   bg     **10**
8  MM4  S2   uyi    **7**
``````

``````   Sp   Mt   Value  count
4  MM2  S4   bg     10
5  MM2  S4   dgd    1
6  MM4  S2   rd     2
7  MM4  S2   cb     8
8  MM4  S2   uyi    8
``````

``````   Sp   Mt   Value  count
4  MM2  S4   bg     10
7  MM4  S2   cb     8
8  MM4  S2   uyi    8
``````
``````In [1]: df
Out[1]:
Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

In [2]: df.groupby(['Sp', 'Mt'])['count'].max()
Out[2]:
Sp   Mt
MM1  S1     3
S3     5
MM2  S3     8
S4    10
MM4  S2     7
Name: count, dtype: int64
``````

``````In [3]: idx = df.groupby(['Sp', 'Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7
``````

``````In [5]: df['count_max'] = df.groupby(['Sp', 'Mt'])['count'].transform(max)

In [6]: df
Out[6]:
Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          5
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7
``````

``````df.sort_values('count', ascending=False).drop_duplicates(['Sp','Mt'])
``````

``idxmax()``

``````In [367]: df
Out[367]:
sp  mt  val  count
0  MM1  S1    a      3
1  MM1  S1    n      2
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
5  MM2  S4  dgb      1
6  MM4  S2   rd      2
7  MM4  S2   cb      2
8  MM4  S2  uyi      7

# Apply idxmax() and use .loc() on dataframe to filter the rows with max values:
In [368]: df.loc[df.groupby(["sp", "mt"])["count"].idxmax()]
Out[368]:
sp  mt  val  count
0  MM1  S1    a      3
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
8  MM4  S2  uyi      7

# Just to show what values are returned by .idxmax() above:
In [369]: df.groupby(["sp", "mt"])["count"].idxmax().values
Out[369]: array([0, 2, 3, 4, 8])
``````

``groupby()``
，但同时使用
``sort_values``
+
``drop_duplicates``

``````df.sort_values('count').drop_duplicates(['Sp', 'Mt'], keep='last')
Out[190]:
Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
``````

``tail``

``````df.sort_values('count').groupby(['Sp', 'Mt']).tail(1)
Out[52]:
Sp  Mt Value  count
0  MM1  S1     a      3
2  MM1  S3    cb      5
8  MM4  S2   uyi      7
3  MM2  S3    mk      8
4  MM2  S4    bg     10
``````

``````df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df_grouped = df.groupby(['sp', 'mt']).agg({'count':'max'})

df_grouped = df_grouped.reset_index()

df_grouped = df_grouped.rename(columns={'count':'count_max'})

df = pd.merge(df, df_grouped, how='left', on=['sp', 'mt'])

df = df[df['count'] == df['count_max']]
``````

``groupby``
``idxmax``

1. ``date``
转移到
``datetime``

``````df['date'] = pd.to_datetime(df['date'])
``````
2. 获取列

``max``
``date``
的索引，在
``groupyby ad_id``
之后：

``````idx = df.groupby(by='ad_id')['date'].idxmax()
``````
3. 获取想要的数据：

``````df_max = df.loc[idx,]
``````
``````   ad_id  price       date
7     22      2 2018-06-11
6     23      2 2018-06-22
2     24      2 2018-06-30
3     28      5 2018-06-22
``````

``````df[df['count'] == df.groupby(['Mt'])['count'].transform(max)]
``````

``````import pandas as pd
import numpy as np
import time

df = pd.DataFrame(np.random.randint(1,10,size=(1000000, 2)), columns=list('AB'))

start_time = time.time()
df1idx = df.groupby(['A'])['B'].transform(max) == df['B']
df1 = df[df1idx]
print("---1 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df2 = df.sort_values('B').groupby(['A']).tail(1)
print("---2 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3 = df.sort_values('B').drop_duplicates(['A'],keep='last')
print("---3 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df3b = df.sort_values('B', ascending=False).drop_duplicates(['A'])
print("---3b) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
df4 = df[df['B'] == df.groupby(['A'])['B'].transform(max)]
print("---4 ) %s seconds ---" % (time.time() - start_time))

start_time = time.time()
d = df.groupby('A')['B'].nlargest(1)
df5 = df.iloc[[i[1] for i in d.index], :]
print("---5 ) %s seconds ---" % (time.time() - start_time))
``````

• --1 ) 0.03337574005126953 秒 ---
• --2 ) 0.1346898078918457 秒 ---
• --3 ) 0.10243558883666992 秒 ---
• --3b) 0.1004343032836914 秒 ---
• --4 ) 0.028397560119628906 秒 ---
• --5 ) 0.07552886009216309 秒 ---

``nlargest``
。优点是它返回从中获取“nlargest item(s)”的行，我们可以得到它们的索引。

``n=1``

``keep='all'``

``('MM1', 'S1', 0)``
）。

``````df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})

d = df.groupby(['Sp', 'Mt'])['count'].nlargest(1, keep='all')

df.loc[[i[-1] for i in d.index]]
``````
``````    Sp  Mt  Val  count
0  MM1  S1    a      3
2  MM1  S3   cb      5
3  MM2  S3   mk      8
4  MM2  S4   bg     10
8  MM4  S2  uyi      7
``````

``````In [85]: import pandas as pd

In [86]: df = pd.DataFrame({
...: 'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
...: 'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
...: 'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
...: 'count' : [3,2,5,8,10,1,2,2,7]
...: })

## Apply nlargest(1) to find the max val df, and nlargest(n) gives top n values for df:
In [87]: df.groupby(["sp", "mt"]).apply(lambda x: x.nlargest(1, "count")).reset_index(drop=True)
Out[87]:
count  mt   sp  val
0      3  S1  MM1    a
1      5  S3  MM1   cb
2      8  S3  MM2   mk
3     10  S4  MM2   bg
4      7  S2  MM4  uyi
``````

``````df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.sort_values("count", ascending=False).groupby(["sp", "mt"]).first().reset_index()
``````

1. 升序排序，删除重复项保留在最后（2.22 秒）
2. 降序排序，删除重复项保持在第一位（2.32 秒）
3. 在 loc 函数内转换 Max（3.73 秒）
4. Transform Max 存储 IDX 然后使用 loc select 作为第二步（3.84 s）
5. Groupby 使用 Tail（8.98 秒）
6. IDMax with groupby 然后使用 loc select 作为第二步（95.39 s）
7. IDMax with groupby within the loc select (95.74 s)
8. NLargest(1) 然后使用 iloc select 作为第二步（> 35000 s）- 整夜运行后没有完成
9. NLargest(1) within iloc select (> 35000 s ) - 运行一夜后没有完成

``````df = pd.DataFrame({
'sp' : ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4','MM4'],
'mt' : ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'val' : ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'count' : [3,2,5,8,10,1,2,2,7]
})

df.groupby(['sp', 'mt']).apply(lambda grp: grp.nlargest(1, 'count'))
``````

``````df = pd.DataFrame({
'Sp': ['MM1', 'MM1', 'MM1', 'MM2', 'MM2', 'MM2', 'MM4', 'MM4', 'MM4'],
'Mt': ['S1', 'S1', 'S3', 'S3', 'S4', 'S4', 'S2', 'S2', 'S2'],
'Val': ['a', 'n', 'cb', 'mk', 'bg', 'dgb', 'rd', 'cb', 'uyi'],
'Count': [3, 2, 5, 8, 10, 1, 2, 2, 7]
})

(df.groupby(['Sp', 'Mt'])
.apply(lambda group: group[group['Count'] == group['Count'].max()])
.reset_index(drop=True))

Sp  Mt  Val  Count
0  MM1  S1    a      3
1  MM1  S3   cb      5
2  MM2  S3   mk      8
3  MM2  S4   bg     10
4  MM4  S2  uyi      7
``````

``.reset_index(drop=True)``

``df.loc[df.groupby('mt')['count'].idxmax()]``

``df``

``df.reset_index(inplace=True)``