在 pandas 中为 groupby 创建序列模式

问题描述 投票:0回答:1

我有一个三列的数据集

ID
,'sort_seq
and
level
. basically i want to identify id wise level sequence sort by sort_seq. please suggest any optimal code other then 
for循环`,因为字典和附加到列表中需要更长的时间。

输入数据集

    import pandas as pd
import numpy as np
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': [89, 24, 56,  8,  5, 64, 93, 88, 61, 31, 50, 75,  1, 81],
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

预期输出

尝试过代码

collect = []
for ij in df.id.unique():
  idict = {}
  x =  df[df['id'] == ij]
  x = x.sort_values(by='sort_seq',ascending=True)
  x = x.reset_index()
  idict[ij] =  x['level'].tolist()
  collect.append(idict)
collect
pandas dataframe dictionary design-patterns sequence
1个回答
0
投票

用途:

np.random.seed(123)
    
data = {'id': [1, 1, 1, 1,2, 2, 3, 3, 3, 3, 4, 5, 5, 6],
        'sort_seq': np.random.randint(0, 100, size=14),
        'level':['a', 'a',  'b', 'c', 'x', 'x', 'g', 'a', 'b', 'b', 'b', 'c', 'c','b']}
df = pd.DataFrame(data)

df1 = df.sort_values(['id','sort_seq'])

df1 = (df1.groupby(['id', df1['level'].ne(df1['level'].shift()).cumsum()])['level'].value_counts()
           .droplevel(1)
           .reset_index()
           .assign(level=lambda x: x['count'].astype(str) + x['level'])
            .groupby('id')['level'].agg(','.join)
            .reset_index(name='PATTERN')
           )
print(df1)
   id      PATTERN
0   1     1c,2a,1b
1   2           2x
2   3  1b,1g,1b,1a
3   4           1b
4   5           2c
5   6           1b
© www.soinside.com 2019 - 2024. All rights reserved.