当使用 pandas 数据框时,其中一列是时间,为什么我们通常将时间列转换为索引?有什么好处?
我可以实施索引,但不确定收益是什么
假设你有这个数据框:
import pandas as pd
import numpy as np
dti = pd.date_range('2024-1-27', periods=24, freq='H')
temp = np.random.randint(10, 25, len(dti))
df = pd.DataFrame({'DateTime': dti, 'Temperature': temp})
print(df)
DateTime Temperature
0 2024-01-27 00:00:00 10
1 2024-01-27 01:00:00 21
2 2024-01-27 02:00:00 24
3 2024-01-27 03:00:00 18
4 2024-01-27 04:00:00 12
5 2024-01-27 05:00:00 16
6 2024-01-27 06:00:00 16
7 2024-01-27 07:00:00 15
8 2024-01-27 08:00:00 10
9 2024-01-27 09:00:00 20
10 2024-01-27 10:00:00 12
11 2024-01-27 11:00:00 18
12 2024-01-27 12:00:00 24
13 2024-01-27 13:00:00 17
14 2024-01-27 14:00:00 21
15 2024-01-27 15:00:00 15
16 2024-01-27 16:00:00 20
17 2024-01-27 17:00:00 16
18 2024-01-27 18:00:00 14
19 2024-01-27 19:00:00 12
20 2024-01-27 20:00:00 24
21 2024-01-27 21:00:00 24
22 2024-01-27 22:00:00 16
23 2024-01-27 23:00:00 16
如果您想选择下午 13 点到 17 点之间的行,您应该:
>>> df[df['DateTime'].between('2024-01-27 13:00:00', '2024-01-27 17:00:00')]
DateTime Temperature
13 2024-01-27 13:00:00 17
14 2024-01-27 14:00:00 21
15 2024-01-27 15:00:00 15
16 2024-01-27 16:00:00 20
17 2024-01-27 17:00:00 16
假设您想对值重新采样(3H 平均值):
>>> df.resample('3H', on='DateTime').mean().reset_index()
DateTime Temperature
0 2024-01-27 00:00:00 18.333333
1 2024-01-27 03:00:00 15.333333
2 2024-01-27 06:00:00 13.666667
3 2024-01-27 09:00:00 16.666667
4 2024-01-27 12:00:00 20.666667
5 2024-01-27 15:00:00 17.000000
6 2024-01-27 18:00:00 16.666667
7 2024-01-27 21:00:00 18.666667
现在我们将
DateTime
列定义为索引:
>>> df = df.set_index('DateTime')
Temperature
DateTime
2024-01-27 00:00:00 10
2024-01-27 01:00:00 21
2024-01-27 02:00:00 24
2024-01-27 03:00:00 18
2024-01-27 04:00:00 12
2024-01-27 05:00:00 16
2024-01-27 06:00:00 16
2024-01-27 07:00:00 15
2024-01-27 08:00:00 10
2024-01-27 09:00:00 20
2024-01-27 10:00:00 12
2024-01-27 11:00:00 18
2024-01-27 12:00:00 24
2024-01-27 13:00:00 17
2024-01-27 14:00:00 21
2024-01-27 15:00:00 15
2024-01-27 16:00:00 20
2024-01-27 17:00:00 16
2024-01-27 18:00:00 14
2024-01-27 19:00:00 12
2024-01-27 20:00:00 24
2024-01-27 21:00:00 24
2024-01-27 22:00:00 16
2024-01-27 23:00:00 16
并进行相同的操作:
>>> df['2024-01-27 13:00:00':'2024-01-27 17:00:00']
Temperature
DateTime
2024-01-27 13:00:00 17
2024-01-27 14:00:00 21
2024-01-27 15:00:00 15
2024-01-27 16:00:00 20
2024-01-27 17:00:00 16
>>> df.resample('3H').mean()
Temperature
DateTime
2024-01-27 00:00:00 18.333333
2024-01-27 03:00:00 15.333333
2024-01-27 06:00:00 13.666667
2024-01-27 09:00:00 16.666667
2024-01-27 12:00:00 20.666667
2024-01-27 15:00:00 17.000000
2024-01-27 18:00:00 16.666667
2024-01-27 21:00:00 18.666667
这显然是更简洁的写法。