我如何遍历时间列表,找到跳过/错过的时间,然后填写该时间?

问题描述 投票:1回答:2

我每天有12分钟的文件清单。我应该有120个文件,但我只有114个文件。这意味着有6、12分钟的丢失数据/文件时间。我创建了一个数组,左边的日期时间,右边的数据:


"row" -|-     "Time"       -|-       "Rain_Rate"      


1:    2014/04/02 00:00:09  |           0.5

2:    2014/04/02 00:12:20  |           1.1

3:    2014/04/02 00:24:05  |           0.48

4:    2014/04/02 00:36:51  |           2.3

5:    2014/04/02 01:00:08  |           4.1

6:    2014/04/02 01:12:26  |           5.0  

7:    2014/04/02 01:24:02  |           3.2

您可以看到,第4行和第5行之间缺少时间(应为00:48:00)。如何通过从任一侧获取Rain_Rates的平均值来填充此行?

另外,如果缺少2次/行,我该怎么办?

python pandas loops interpolation missing-data
2个回答
0
投票

首先,以12分钟的频率构建一个DateTimeIndex:

import datetime
import pandas as pd
import numpy as np

start = datetime.datetime(2014, 4, 2)
end = datetime.datetime(2014, 8, 1)  # whenever your time series ends
idx = pd.date_range(start, end, freq='12T')  # 12T = 12 minutes

接下来,您必须使用新索引构建数据框:

df = pd.DataFrame(np.nan, index=idx, columns=['dummy']) # you need to provide a column name

我想,您的数据位于带有DateTimeIndex的pd.Series s中,否则您必须构建一个。

现在您可以使用DateTimeIndex和熊猫的全部功能:

df['Rain_Rate'] = s
df['Rain_Rate'] = df['Rain_Rate'].interpolate()  # standard is linear approximation

查看interpolate()了解更多选项。


0
投票

作为替代,您可以保持Time系列不变,并且仅在描述的缺失位置添加行,并以行之间的时间差大于12分钟为条件。根据您以前的现有值进行交易,您将不会获得完美的12分钟。生成行和下一个现有值之间的时间片。

import pandas as pd

df = pd.DataFrame([
["2014/04/02 00:00:09",0.5],
["2014/04/02 00:12:20",1.1],
["2014/04/02 00:24:05",0.48],
["2014/04/02 00:36:51",2.3],
["2014/04/02 01:00:08",4.1],
["2014/04/02 01:12:26",5.0],
["2014/04/02 01:24:02",3.2],
["2014/04/02 02:44:02",1.2], # added for test
["2014/04/02 03:54:02",7.72] # added for test
])

df[0] = pd.to_datetime(df[0])
print(df)

delta = df[0].diff()
diff_idx = delta.where(delta > pd.Timedelta("00:12:59"))
print(delta)
idx = df[diff_idx.notnull()].index

td = pd.Timedelta("00:12:00")

for k in idx:
    deltaT = (df.loc[k, 0] - df.loc[k - 1, 0])
    num_missrows =  deltaT // td
    num_missrows -= (-1, 0)[deltaT % td == pd.Timedelta(0)] # don't overlap last time value
    new_avg = df.loc[k - 1, 1]  # previous existing Rain value

    for i in range(1, num_missrows):
        avg = (df.loc[k, 1] + new_avg)/2
        new_row = [[df.loc[k-1, 0] + i * td, avg]]
        new_avg = avg
        df = df.append(new_row)

df = df.sort_values(by=0).reset_index(drop=True)
print(df)

<< df的输出> 0 1 0 2014-04-02 00:00:09 0.50000 1 2014-04-02 00:12:20 1.10000 2 2014-04-02 00:24:05 0.48000 3 2014-04-02 00:36:51 2.30000 4 2014-04-02 00:48:51 3.20000 # added row 5 2014-04-02 01:00:08 4.10000 6 2014-04-02 01:12:26 5.00000 7 2014-04-02 01:24:02 3.20000 8 2014-04-02 01:36:02 2.20000 # added row 9 2014-04-02 01:48:02 1.70000 # added row 10 2014-04-02 02:00:02 1.45000 # added row 11 2014-04-02 02:12:02 1.32500 # added row 12 2014-04-02 02:24:02 1.26250 # added row 13 2014-04-02 02:36:02 1.23125 # added row, Not 12min. (~8min. diff.) 14 2014-04-02 02:44:02 1.20000 15 2014-04-02 02:56:02 4.46000 # added row 16 2014-04-02 03:08:02 6.09000 # added row 17 2014-04-02 03:20:02 6.90500 # added row 18 2014-04-02 03:32:02 7.31250 # added row 19 2014-04-02 03:44:02 7.51625 # added row, Not 12min. (~10min. diff.) 20 2014-04-02 03:54:02 7.72000

© www.soinside.com 2019 - 2024. All rights reserved.