city temperature windspeed event
day
2017-01-01 new york 32 6 Rain
2017-01-02 new york 36 7 Sunny
2017-01-03 new york 28 12 Snow
2017-01-04 new york 33 7 Sunny
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-07 new york 27 12 Rain
2017-01-08 new york 23 7 Rain
2017-01-01 mumbai 90 5 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-03 mumbai 87 15 Fog
2017-01-04 mumbai 92 5 Rain
2017-01-05 mumbai 89 7 Sunny
2017-01-06 mumbai 80 10 Fog
2017-01-07 mumbai 85 9 Sunny
2017-01-08 mumbai 89 8 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 paris 50 13 Cloudy
2017-01-03 paris 54 8 Cloudy
2017-01-04 paris 42 10 Cloudy
2017-01-05 paris 43 20 Sunny
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
上图为原始数据。
下面显示了使用 np.array_split(data, 4) 的结果。
day city temperature windspeed event
2017-01-01 new york 32 6 Rain
2017-01-02 new york 36 7 Sunny
2017-01-03 new york 28 12 Snow
2017-01-04 new york 33 7 Sunny
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-07 new york 27 12 Rain
day city temperature windspeed event
2017-01-08 new york 23 7 Rain
2017-01-01 mumbai 90 5 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-03 mumbai 87 15 Fog
2017-01-04 mumbai 92 5 Rain
2017-01-05 mumbai 89 7 Sunny
day city temperature windspeed event
2017-01-06 mumbai 80 10 Fog
2017-01-07 mumbai 85 9 Sunny
2017-01-08 mumbai 89 8 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 paris 50 13 Cloudy
2017-01-03 paris 54 8 Cloudy
day city temperature windspeed event
2017-01-04 paris 42 10 Cloudy
2017-01-05 paris 43 20 Sunny
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
正如您在这里所看到的,我尝试根据原始数据创建 4 个组,确保每个组都包含所有城市。但是,通过使用 array.split(),它将数据分为 4 组,但不包含所有城市。我希望每个小组都有孟买、巴黎和纽约。 我怎样才能做到这一点?
意思是说,我想要实现的目标如下:
第 1 组:
day city temperature windspeed event
2017-01-01 new york 32 6 Rain
2017-01-02 paris 50 13 Cloudy
2017-01-02 mumbai 85 12 Fog,
2017-01-05 new york 31 7 Rain
2017-01-06 new york 33 5 Sunny
2017-01-05 mumbai 89 7 Sunny
2017-01-05 paris 43 20 Sunny
第 2 组:
day city temperature windspeed event
2017-01-04 new york 33 7 Sunny
2017-01-01 mumbai 90 5 Sunny
2017-01-03 paris 54 8 Cloudy
2017-01-07 new york 27 12 Rain
2017-01-06 mumbai 80 10 Fog
2017-01-09 paris 53 8 Sunny
第 3 组:
day city temperature windspeed event
2017-01-02 new york 36 7 Sunny
2017-01-03 mumbai 87 15 Fog
2017-01-01 paris 45 20 Sunny,
2017-01-08 mumbai 89 8 Rain
2017-01-06 paris 48 4 Cloudy
2017-01-07 paris 40 14 Rain
第 4 组:
day city temperature windspeed event
2017-01-03 new york 28 12 Snow,
2017-01-04 mumbai 92 5 Rain
2017-01-07 mumbai 85 9 Sunny
2017-01-04 paris 42 10 Cloudy
2017-01-08 paris 42 15 Cloudy
2017-01-08 new york 23 7 Rain
从预期结果中可以看出,最主要的是所有组都包含每个主题。
我的想法是按城市对数据进行分组,然后从每个城市的数据框中将数据分为4组,然后对于城市中的每个组,将数据组合起来得到4个最终组。
您可以通过
GroupBy
+ cumcount
创建辅助列来统计每个城市的出现次数。
然后使用
dict
+ tuple
和另一个 GroupBy
创建一个数据帧字典,每个数据帧仅包含每个城市的一次出现。
# add index column giving count of city occurrence
df['index'] = df.groupby('city').cumcount()
# create dictionary of dataframes
d = dict(tuple(df.groupby('index')))
结果:
print(d)
{0: city temperature windspeed event index
day
2017-01-01 newyork 32 6 Rain 0
2017-01-01 mumbai 90 5 Sunny 0
2017-01-01 paris 45 20 Sunny 0,
1: city temperature windspeed event index
day
2017-01-02 newyork 36 7 Sunny 1
2017-01-02 mumbai 85 12 Fog 1
2017-01-02 paris 50 13 Cloudy 1,
2: city temperature windspeed event index
day
2017-01-03 newyork 28 12 Snow 2
2017-01-03 mumbai 87 15 Fog 2
2017-01-03 paris 54 8 Cloudy 2,
3: city temperature windspeed event index
day
2017-01-04 newyork 33 7 Sunny 3
2017-01-04 mumbai 92 5 Rain 3
2017-01-04 paris 42 10 Cloudy 3}
然后您可以通过
d[0]
、d[1]
、d[2]
、d[3]
提取各个“组”。在这种特殊情况下,您可能希望按日期分组,即
d = {df_.index[0]: df_ for _, df_ in df.groupby('index')}
这是我的方法。首先按
day
和 city
对数据框进行排序:
df = df.sort_values(by=['day', 'city'])
接下来为您的数据帧找到 4 组的均匀分割 - 如果分割不均匀,那么最后一组将获得剩余的:
n = int(len(df)/4)
groups_n = np.cumsum([0, n, n, n, len(df)-(3*n)])
print(groups_n)
OUT >> array([ 0, 6, 12, 18, 25], dtype=int32)
groups_n
是每组的 start
和 end
。所以Group 1
我会带df.iloc[0:6]
并且Group 4
我会带df.iloc[18:25]
。
因此,数据帧的 4 组分割的最终字典
d
将是:
d = {}
for i in range(4):
d[i+1] = df.iloc[groups_n[i]:groups_n[i+1]]
输出示例:第 1 组 (
)d[1]
city temperature windspeed event
day
2017-01-01 mumbai 90 5 Sunny
2017-01-01 new york 32 6 Rain
2017-01-01 paris 45 20 Sunny
2017-01-02 mumbai 85 12 Fog
2017-01-02 new york 36 7 Sunny
2017-01-02 paris 50 13 Cloudy
第四组:(
)d[4]
city temperature windspeed event
day
2017-01-07 mumbai 85 9 Sunny
2017-01-07 new york 27 12 Rain
2017-01-07 paris 40 14 Rain
2017-01-08 mumbai 89 8 Rain
2017-01-08 new york 23 7 Rain
2017-01-08 paris 42 15 Cloudy
2017-01-09 paris 53 8 Sunny
@gyx-hh 的 cumsum 策略扩展能够为我解决这个问题
df['samplegroup'] = df.sample(frac=1).groupby('city').cumcount()
df['samplegroup'] = ((df['samplegroup'] + df.groupby('city').ngroup()) % 3)
grouped = df.groupby(['samplegroup'])
list(grouped)
[(0,
day city temperature windspeed event samplegroup
5 2017-01-06 new york 33 5 Sunny 0
7 2017-01-08 new york 23 7 Rain 0
9 2017-01-02 mumbai 85 12 Fog 0
12 2017-01-05 mumbai 89 7 Sunny 0
15 2017-01-08 mumbai 89 8 Rain 0
18 2017-01-03 paris 54 8 Cloudy 0
20 2017-01-05 paris 43 20 Sunny 0
22 2017-01-07 paris 40 14 Rain 0),
(1,
day city temperature windspeed event samplegroup
0 2017-01-01 new york 32 6 Rain 1
3 2017-01-04 new york 33 7 Sunny 1
4 2017-01-05 new york 31 7 Rain 1
8 2017-01-01 mumbai 90 5 Sunny 1
10 2017-01-03 mumbai 87 15 Fog 1
13 2017-01-06 mumbai 80 10 Fog 1
16 2017-01-01 paris 45 20 Sunny 1
17 2017-01-02 paris 50 13 Cloudy 1
21 2017-01-06 paris 48 4 Cloudy 1),
(2,
day city temperature windspeed event samplegroup
1 2017-01-02 new york 36 7 Sunny 2
2 2017-01-03 new york 28 12 Snow 2
6 2017-01-07 new york 27 12 Rain 2
11 2017-01-04 mumbai 92 5 Rain 2
14 2017-01-07 mumbai 85 9 Sunny 2
19 2017-01-04 paris 42 10 Cloudy 2
23 2017-01-08 paris 42 15 Cloudy 2
24 2017-01-09 paris 53 8 Sunny 2)]