如何从事件持续时间的数据帧创建时间序列?

问题描述 投票:0回答:3

我有一个包含一个房间预订的数据框(行:booking_id、入住日期和退房日期,我想将其转换为按全年索引的时间序列(索引:一年中的天数,特征:预订或未预订) ).

我计算了预订的持续时间,并每天重新索引数据框。 现在我需要前向填充数据框,但次数有限:每次预订的持续时间。

尝试使用 ffill 迭代每一行,但它适用于整个数据框,而不是选定的行。 知道我该怎么做吗?

这是我的代码:

import numpy as np
import pandas as pd
#create dataframe
data=[[1, '2019-01-01', '2019-01-02', 1],
      [2, '2019-01-03', '2019-01-07', 4], 
      [3, '2019-01-10','2019-01-13', 3]]
df = pd.DataFrame(data, columns=['booking_id', 'check-in', 'check-out', 'duration'])

#cast dates to datetime formats
df['check-in'] = pd.to_datetime(df['check-in'])
df['check-out'] = pd.to_datetime(df['check-out'])

#create timeseries indexed on check-in date
df2 = df.set_index('check-in')

#create new index and reindex timeseries
idx = pd.date_range(min(df['check-in']), max(df['check-out']), freq='D')
ts = df2.reindex(idx)

我有这个:

    booking_id  check-out   duration
2019-01-01  1.0     2019-01-02  1.0
2019-01-02  NaN     NaT     NaN
2019-01-03  2.0     2019-01-07  4.0
2019-01-04  NaN     NaT     NaN
2019-01-05  NaN     NaT     NaN
2019-01-06  NaN     NaT     NaN
2019-01-07  NaN     NaT     NaN
2019-01-08  NaN     NaT     NaN
2019-01-09  NaN     NaT     NaN
2019-01-10  3.0     2019-01-13  3.0
2019-01-11  NaN     NaT     NaN
2019-01-12  NaN     NaT     NaN
2019-01-13  NaN     NaT     NaN

我希望有:

    booking_id  check-out   duration
2019-01-01  1.0     2019-01-02  1.0
2019-01-02  1.0     2019-01-02      1.0
2019-01-03  2.0     2019-01-07  4.0
2019-01-04  2.0     2019-01-07  4.0
2019-01-05  2.0     2019-01-07  4.0
2019-01-06  2.0     2019-01-07  4.0
2019-01-07  NaN     NaT     NaN
2019-01-08  NaN     NaT     NaN
2019-01-09  NaN     NaT     NaN
2019-01-10  3.0     2019-01-13  3.0
2019-01-11  3.0     2019-01-13  3.0
2019-01-12  3.0     2019-01-13  3.0
2019-01-13  NaN     NaT     NaN
python pandas time-series
3个回答
1
投票
filluntil = ts['check-out'].ffill()
m = ts.index < filluntil.values

#reshaping the mask to be shame shape as ts
m = np.repeat(m, ts.shape[1]).reshape(ts.shape)

ts = ts.ffill().where(m)

首先我们创建一个填充日期的系列。然后我们创建一个掩码,其中索引小于填充值。然后我们根据我们的蒙版进行填充。

如果您想包含带有退房日期的行,请将 m 从 < to <=

更改为

1
投票

我认为要“前向填充数据帧”,您应该使用 pandas 插值方法。文档可以在这里找到:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html

你可以这样做:

int_how_many_consecutive_to_fill = 3
df2 = df2.interpolate(axis=0, limit=int_how_many_consecutive_to_fill, limit_direction='forward')

查看插值的具体文档,您可以使用标志添加许多自定义功能到方法中。

编辑:

要使用每个插值的持续时间列中的行值来执行此操作,这有点混乱,但我认为它应该可以工作(可能有一个不那么老套、更干净的解决方案,使用 pandas 或我不知道的另一个库中的某些功能) :

#get rows with nans in them:
nans_df =  df2[df2.isnull()]
#get rows without nans in them:
non_nans_df =  df2[~df2.isnull()]

#list of dfs we will concat vertically at the end to get final dataframe.
dfs = []

#iterate through each row that contains NaNs.
for nan_index, nan_row in nans_df.iterrows():
    previous_day = nan_index - pd.DateOffset(1)
    #this checks if the previous day to this NaN row is a day where we have non nan values, if the previous day is a nan day just skip this loop. This is mostly here to handle the case where the first row is a NaN one.
    if previous_day not in non_nans_df.index:
        continue

    date_offset = 0
    #here we are checking how many sequential rows there are after this one with all nan values in it, this will be stored in the date_offset variable.
    while (nan_index + pd.DateOffset(date_offset)) in nans_df.index:
        date_offset += 1

    #this gets us the last date in the sequence of continuous days with all nan values after this current one. 
    end_sequence_date = nan_index + pd.DateOffset(date_offset)

    #this gives us a dataframe where the first row in it is the previous day to this one(nan_index), confirmed to be non NaN by the first if statement in this for loop. It then combines this non NaN row with all the sequential nan rows after it into the variable df_to_interpolate. 
    df_to_interpolate = non_nans_df.iloc[previous_day].append(nans_df.iloc[nan_index:end_sequence_date]) 

    # now we pull the duration value for the first row in our  df_to_interpolate dataframe. 
    limit_val = int(df_to_interpolate['duration'][0])

    #here we interpolate the dataframe using the limit_val
    df_to_interpolate = df_to_interpolate.interpolate(axis=0, limit=limit_val, limit_direction='forward')

    #append df_to_interpolate to our list that gets combined at the end.
    dfs.append(df_to_interpolate)

 #gives us our final dataframe, interpolated forward using a dynamic limit value based on the most recent duration value. 
 final_df = pd.concat(dfs)

0
投票
def function1(dd: pd.DataFrame):
    num2=int(dd.iat[0,2])
    return dd.combine_first(dd.iloc[:num2,:].ffill())

df1.assign(col1=df1.duration.gt(0).cumsum()).groupby(['col1'], as_index=0, group_keys=0).apply(function1)

         booking_id   check-out  duration  col1
2019-01-01         1.0  2019-01-02       1.0     1
2019-01-02         NaN         NaN       NaN     1
2019-01-03         2.0  2019-01-07       4.0     2
2019-01-04         2.0  2019-01-07       4.0     2
2019-01-05         2.0  2019-01-07       4.0     2
2019-01-06         2.0  2019-01-07       4.0     2
2019-01-07         NaN         NaN       NaN     2
2019-01-08         NaN         NaN       NaN     2
2019-01-09         NaN         NaN       NaN     2
2019-01-10         3.0  2019-01-13       3.0     3
2019-01-11         3.0  2019-01-13       3.0     3
2019-01-12         3.0  2019-01-13       3.0     3
2019-01-13         NaN         NaN       NaN     3
© www.soinside.com 2019 - 2024. All rights reserved.