我有这样的数据框
df = pd.DataFrame({'grade': ['A','C','B'], 'year': [2018,2015,2017], 'label': [1,2,3]})
grade year label
0 A 2018 1
1 C 2015 2
2 B 2017 3
我想根据年份列(每个标签的最近年份)扩展数据框。基本上,每个标签要多产生4行,以涵盖最近5年的总和。
预期输出:
print(df_expanded)
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
我尝试过的:
for lab in df['label'].unique():
grp = df.loc[(df['label']==lab)]
yr = grp['year'].iloc[0]
df_year = pd.DataFrame({'year': list(reversed(range(yr-4,yr+1)))})
df_merged = pd.merge(grp, df_year, how='outer', left_on=['year'], right_on=['year'])
df_merged = df_merged.fillna(method='ffill')
df_expanded=pd.concat([df_expanded,df_merged],axis=0)
df_expanded = df_expanded.reset_index(drop=True)
df_expanded['label'] = df_expanded['label'].astype(int)
我的“ for循环”方法有效。但是,它在我的实际数据集(包含大约30000个标签)上的运行非常慢。我想知道一定有更好的方法可以做到这一点。非常感谢!
您可以尝试:
(pd.concat(df.assign(year=df['year'].sub(i)) for i in range(5))
.sort_index()
.reset_index(drop=True)
)
输出:
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
pd.DataFrame(
[
(g, y, l) for g, Y, l in zip(*map(df.get, df))
for y in range(Y, Y - 5, -1)
],
columns=df.columns
)
grade year label
0 A 2018 1
1 A 2017 1
2 A 2016 1
3 A 2015 1
4 A 2014 1
5 C 2015 2
6 C 2014 2
7 C 2013 2
8 C 2012 2
9 C 2011 2
10 B 2017 3
11 B 2016 3
12 B 2015 3
13 B 2014 3
14 B 2013 3
explode
df.assign(year=[range(y, y - 5, -1) for y in df.year]).explode('year')
grade year label
0 A 2018 1
0 A 2017 1
0 A 2016 1
0 A 2015 1
0 A 2014 1
1 C 2015 2
1 C 2014 2
1 C 2013 2
1 C 2012 2
1 C 2011 2
2 B 2017 3
2 B 2016 3
2 B 2015 3
2 B 2014 3
2 B 2013 3