将 pandas 中的 groupby() 分成更小的组并将它们组合

问题描述 投票:0回答:3
            city  temperature  windspeed   event
            day                                                 
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

上图为原始数据。

下面显示了使用 np.array_split(data, 4) 的结果。

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6    Rain
            2017-01-02  new york           36          7   Sunny
            2017-01-03  new york           28         12    Snow
            2017-01-04  new york           33          7   Sunny
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-07  new york           27         12    Rain  

            day city  temperature  windspeed  event                                                    
            2017-01-08  new york           23          7  Rain
            2017-01-01    mumbai           90          5   Sunny
            2017-01-02    mumbai           85         12     Fog
            2017-01-03    mumbai           87         15     Fog
            2017-01-04    mumbai           92          5    Rain
            2017-01-05    mumbai           89          7   Sunny             
            day city  temperature  windspeed  event                                                  
            2017-01-06    mumbai           80         10     Fog
            2017-01-07    mumbai           85         9     Sunny
            2017-01-08    mumbai           89          8    Rain
            2017-01-01     paris           45         20   Sunny
            2017-01-02     paris           50         13  Cloudy
            2017-01-03     paris           54          8  Cloudy              
            day city  temperature  windspeed  event             
            2017-01-04     paris           42         10  Cloudy
            2017-01-05     paris           43         20   Sunny
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain
            2017-01-08     paris           42         15  Cloudy
            2017-01-09     paris           53         8  Sunny

正如您在这里所看到的,我尝试根据原始数据创建 4 个组,确保每个组都包含所有城市。但是,通过使用 array.split(),它将数据分为 4 组,但不包含所有城市。我希望每个小组都有孟买、巴黎和纽约。 我怎样才能做到这一点?

意思是说,我想要实现的目标如下:

第 1 组:

            day city  temperature  windspeed  event                                                
            2017-01-01  new york           32          6   Rain
            2017-01-02  paris           50         13  Cloudy
            2017-01-02    mumbai           85         12    Fog, 
            2017-01-05  new york           31          7    Rain
            2017-01-06  new york           33          5   Sunny
            2017-01-05    mumbai           89          7   Sunny  
            2017-01-05     paris           43         20   Sunny

第 2 组:

            day city  temperature  windspeed  event                                                    
            2017-01-04  new york           33          7  Sunny
            2017-01-01    mumbai           90          5  Sunny
            2017-01-03  paris           54          8  Cloudy
            2017-01-07  new york           27         12    Rain 
            2017-01-06    mumbai           80         10     Fog
            2017-01-09     paris           53         8  Sunny

第 3 组:

            day city  temperature  windspeed  event         
            2017-01-02  new york           36          7  Sunny                                         
            2017-01-03  mumbai           87         15    Fog
            2017-01-01   paris           45         20  Sunny,   
            2017-01-08    mumbai           89          8    Rain
            2017-01-06     paris           48         4  Cloudy
            2017-01-07     paris           40          14  Rain

第 4 组:

            day city  temperature  windspeed  event             
            2017-01-03  new york           28         12   Snow,  
            2017-01-04  mumbai           92          5   Rain
            2017-01-07    mumbai           85         9     Sunny
            2017-01-04  paris           42         10  Cloudy
            2017-01-08     paris           42         15  Cloudy
            2017-01-08  new york           23          7  Rain

从预期结果中可以看出,最主要的是所有组都包含每个主题。

我的想法是按城市对数据进行分组,然后从每个城市的数据框中将数据分为4组,然后对于城市中的每个组,将数据组合起来得到4个最终组。

python python-2.7 pandas grouping pandas-groupby
3个回答
2
投票

您可以通过

GroupBy
+
cumcount
创建辅助列来统计每个城市的出现次数。

然后使用

dict
+
tuple
和另一个
GroupBy
创建一个数据帧字典,每个数据帧仅包含每个城市的一次出现。

# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()

# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))

结果:

print(d)

{0:                city  temperature  windspeed  event  index
 day                                                      
 2017-01-01  newyork           32          6   Rain      0
 2017-01-01   mumbai           90          5  Sunny      0
 2017-01-01    paris           45         20  Sunny      0,
 1:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-02  newyork           36          7   Sunny      1
 2017-01-02   mumbai           85         12     Fog      1
 2017-01-02    paris           50         13  Cloudy      1,
 2:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-03  newyork           28         12    Snow      2
 2017-01-03   mumbai           87         15     Fog      2
 2017-01-03    paris           54          8  Cloudy      2,
 3:                city  temperature  windspeed   event  index
 day                                                       
 2017-01-04  newyork           33          7   Sunny      3
 2017-01-04   mumbai           92          5    Rain      3
 2017-01-04    paris           42         10  Cloudy      3}

然后您可以通过

d[0]
d[1]
d[2]
d[3]
提取各个“组”。在这种特殊情况下,您可能希望按日期分组,即

d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}

0
投票

这是我的方法。首先按

day
city
对数据框进行排序:

df = df.sort_values(by=['day', 'city'])

接下来为您的数据帧找到 4 组的均匀分割 - 如果分割不均匀,那么最后一组将获得剩余的:

n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0,  6, 12, 18, 25], dtype=int32)

groups_n
是每组的
start
end
。所以
Group 1
我会带
df.iloc[0:6]
并且
Group 4
我会带
df.iloc[18:25]

因此,数据帧的 4 组分割的最终字典

d
将是:

d = {}
for i in range(4):
    d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]

输出示例:第 1 组 (

d[1]
)

            city      temperature  windspeed    event
day             
2017-01-01  mumbai    90           5            Sunny
2017-01-01  new york  32           6            Rain
2017-01-01  paris     45           20           Sunny
2017-01-02  mumbai    85           12           Fog
2017-01-02  new york  36           7            Sunny
2017-01-02  paris     50           13           Cloudy

第四组:(

d[4]
)

            city       temperature  windspeed   event
day             
2017-01-07  mumbai     85           9           Sunny
2017-01-07  new york   27           12          Rain
2017-01-07  paris      40           14          Rain
2017-01-08  mumbai     89           8           Rain
2017-01-08  new york   23           7           Rain
2017-01-08  paris      42           15          Cloudy
2017-01-09  paris      53           8           Sunny

0
投票

@gyx-hh 的 cumsum 策略扩展能够为我解决这个问题

df['samplegroup'] = df.sample(frac=1).groupby('city').cumcount()
df['samplegroup'] = ((df['samplegroup'] + df.groupby('city').ngroup()) % 3)
grouped = df.groupby(['samplegroup'])
list(grouped)
[(0,
             day      city  temperature  windspeed   event  samplegroup
  5   2017-01-06  new york           33          5   Sunny            0
  7   2017-01-08  new york           23          7    Rain            0
  9   2017-01-02    mumbai           85         12     Fog            0
  12  2017-01-05    mumbai           89          7   Sunny            0
  15  2017-01-08    mumbai           89          8    Rain            0
  18  2017-01-03     paris           54          8  Cloudy            0
  20  2017-01-05     paris           43         20   Sunny            0
  22  2017-01-07     paris           40         14    Rain            0),
 (1,
             day      city  temperature  windspeed   event  samplegroup
  0   2017-01-01  new york           32          6    Rain            1
  3   2017-01-04  new york           33          7   Sunny            1
  4   2017-01-05  new york           31          7    Rain            1
  8   2017-01-01    mumbai           90          5   Sunny            1
  10  2017-01-03    mumbai           87         15     Fog            1
  13  2017-01-06    mumbai           80         10     Fog            1
  16  2017-01-01     paris           45         20   Sunny            1
  17  2017-01-02     paris           50         13  Cloudy            1
  21  2017-01-06     paris           48          4  Cloudy            1),
 (2,
             day      city  temperature  windspeed   event  samplegroup
  1   2017-01-02  new york           36          7   Sunny            2
  2   2017-01-03  new york           28         12    Snow            2
  6   2017-01-07  new york           27         12    Rain            2
  11  2017-01-04    mumbai           92          5    Rain            2
  14  2017-01-07    mumbai           85          9   Sunny            2
  19  2017-01-04     paris           42         10  Cloudy            2
  23  2017-01-08     paris           42         15  Cloudy            2
  24  2017-01-09     paris           53          8   Sunny            2)]
© www.soinside.com 2019 - 2024. All rights reserved.