如何在两个时间戳之间以 10 分钟为增量创建行,以使一行的结束时间等于下一行的开始时间?

问题描述 投票:0回答:2

我有一个包含四列的数据集,其中两列是时间戳 - “Start_time”和“End_time”。 “Time_diff”列是开始时间和结束时间之间的差。 “公交车”栏是时间范围内记录的公交车数量。

每个时间差应小于或等于 10。但是,我有时间差大于 10 的条目,如下所示(第 3 行)。

开始时间 结束时间 时间_差异 巴士
2023-12-13 11:20:00.00 2023-12-13 11:30:00 10 10
2023-12-13 11:36:00.00 2023-12-13 11:40:00 4 3
2023-12-13 11:40:00.00 2023-12-13 12:10:00 30 23
2023-12-13 13:00:00.00 2023-12-13 13:10:00 10 7
2023-12-13 13:10:00.00 2023-12-13 13:20:00 10 9

我想以 10 分钟为增量创建额外的行,分解时间差大于 10 的两个时间戳,使得一行的结束时间等于下一行的开始时间。

这是我写的函数:

import pandas as pd
import numpy as np

def generate_rows(df, start_col, end_col):
    # Set the time difference increment
    time_difference = pd.Timedelta(minutes=10)

    # Create additional rows
    new_rows = []
    for _, row in df.iterrows():
        current_time = row[start_col]
        end_time = row[end_col]
        while current_time < end_time:
            new_end_time = current_time + time_difference
            new_row = {col: np.nan for col in df.columns}
            new_row[start_col] = current_time
            new_row[end_col] = new_end_time
            new_rows.append(new_row)
            current_time = new_end_time

    # Convert the list of dictionaries to a DataFrame
    new_rows_df = pd.DataFrame(new_rows)

    return pd.concat([df, new_rows_df]).sort_values(by=start_col).reset_index(drop=True)

这就是我期望的输出:

开始时间 结束时间 时间_差异 巴士
2023-12-13 11:20:00.00 2023-12-13 11:30:00 10 10
2023-12-13 11:30:00.00 2023-12-13 11:36:00 6 NaN
2023-12-13 11:36:00.00 2023-12-13 11:40:00 4 3
2023-12-13 11:40:00.00 2023-12-13 11:50:00 10 NaN
2023-12-13 11:50:00.00 2023-12-13 12:00:00 10 NaN
2023-12-13 12:00:00.00 2023-12-13 12:10:00 10 23
2023-12-13 13:00:00.00 2023-12-13 13:10:00 10 7
2023-12-13 13:10:00.00 2023-12-13 13:20:00 10 9

在运行此函数之前,我发现我的数据集中有 19 行时间差异大于 10。运行代码后,仍然有 19 行时间差异大于 10。请帮助!!!

python pandas time-series data-science data-analysis
2个回答
0
投票

另一种选择。无需创建额外的列。

import pandas as pd

def split_time_intervals(df, start_col, end_col, max_diff=10):
    new_rows = []
    
    for _, row in df.iterrows():
        start_time = row[start_col]
        end_time = row[end_col]
        bus_count = row['Bus']

        while (end_time - start_time).total_seconds() / 60 > max_diff:
            new_end_time = start_time + pd.Timedelta(minutes=max_diff)
            new_rows.append({start_col: start_time, end_col: new_end_time, 'Bus': None})
            start_time = new_end_time

        new_rows.append({start_col: start_time, end_col: end_time, 'Bus': bus_count})

    return pd.DataFrame(new_rows)

# Sample dataset
data = {
    'Start_time': pd.to_datetime(['2023-12-13 11:20', '2023-12-13 11:36', '2023-12-13 11:40', '2023-12-13 13:00', '2023-12-13 13:10']),
    'End_time': pd.to_datetime(['2023-12-13 11:30', '2023-12-13 11:40', '2023-12-13 12:10', '2023-12-13 13:10', '2023-12-13 13:20']),
    'Bus': [10, 3, 23, 7, 9]
}
df = pd.DataFrame(data)

# Apply the function
new_df = split_time_intervals(df, 'Start_time', 'End_time')
print(new_df)



Start_time            End_time          Bus
0 2023-12-13 11:20:00 2023-12-13 11:30:00  10.0
1 2023-12-13 11:36:00 2023-12-13 11:40:00   3.0
2 2023-12-13 11:40:00 2023-12-13 11:50:00   NaN
3 2023-12-13 11:50:00 2023-12-13 12:00:00   NaN
4 2023-12-13 12:00:00 2023-12-13 12:10:00  23.0
5 2023-12-13 13:00:00 2023-12-13 13:10:00   7.0
6 2023-12-13 13:10:00 2023-12-13 13:20:00   9.0

0
投票

我不会使用循环,而是使用矢量运算。

您可以根据需要多次

repeat
行,将值拆分为最大值 10,然后重新计算差异、开始和结束:

MAX = 10

# ensure datetime64
df[['Start_time', 'End_time']] = df[['Start_time', 'End_time']].apply(pd.to_datetime)

# identify rows to duplicate
d, r = df['Time_diff'].divmod(MAX)
out = df.loc[df.index.repeat(d+r.ne(0)*2)]

# correct Time_diff
g = out.groupby(level=0)

out['Time_diff'] = (g['Time_diff'].transform('last')
                    .sub(g.cumcount(ascending=False)*MAX)
                    .clip(upper=MAX).abs()
                   )

# Only keep last "Bus" per original row
m = out.index.duplicated(keep='last')
out.loc[m, 'Bus'] = np.nan

# Decrement End_time from original end
end = g['End_time'].transform('last')
out['End_time'] = end.sub(pd.to_timedelta(out
                          .loc[::-1, 'Time_diff']
                          .groupby(level=0)
                          .transform(lambda x: x.cumsum().shift())
                          [::-1], unit='min')
                         ).fillna(end)
# Update Start_time
out['Start_time'] = out['End_time'].sub(pd.to_timedelta(out['Time_diff'], unit='min'))

输出:

          Start_time            End_time  Time_diff   Bus
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10  10.0
1 2023-12-13 11:30:00 2023-12-13 11:36:00          6   NaN
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4   3.0
2 2023-12-13 11:40:00 2023-12-13 11:50:00         10   NaN
2 2023-12-13 11:50:00 2023-12-13 12:00:00         10   NaN
2 2023-12-13 12:00:00 2023-12-13 12:10:00         10  23.0
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10   7.0
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10   9.0

如何运作

首先确定要复制的行

为此,我们使用

divmod
来获取除数和余数。这给出了使用公式
number of rows = divider + (2 is remainder is not null)
的行数(例如
30 -> 3 + 0
4 -> 0 + 2
29 -> 2 + 2
)。我们用它来复制带有
index.repeat

的行

现在我们有具有完全相同信息的重复行:

d, r = df['Time_diff'].divmod(MAX)
out = df.loc[df.index.repeat(d+r.ne(0)*2)]

           Start_time            End_time  Time_diff  Bus
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10   10
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4    3
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4    3
2 2023-12-13 11:40:00 2023-12-13 12:10:00         30   23
2 2023-12-13 11:40:00 2023-12-13 12:10:00         30   23
2 2023-12-13 11:40:00 2023-12-13 12:10:00         30   23
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10    7
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10    9
修正Time_diff

为此,我们使用

groupby.cumcount
枚举每组中的行(从末尾开始),并为每行减去 10 分钟,然后
clip
将值改为 10 分钟:

           Start_time            End_time  Time_diff  Bus  cumcount  cumcount*MAX  clip(upper=MAX)  abs  Time_diff_corr
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10   10         0             0               10   10              10
1 2023-12-13 11:36:00 2023-12-13 11:40:00          6    3         1            10               -6    6               6
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4    3         0             0                4    4               4
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10   23         2            20              -10   10              10
2 2023-12-13 11:40:00 2023-12-13 12:10:00          0   23         1            10                0    0               0
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10   23         0             0               10   10              10
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10    7         0             0               10   10              10
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10    9         0             0               10   10              10
隐藏总线值

我们使用

Index.duplicated
来识别每个索引的最后一个值并屏蔽其他值:

m = out.index.duplicated(keep='last')
out.loc[m, 'Bus'] = np.nan

           Start_time            End_time  Time_diff   Bus      m
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10  10.0  False
1 2023-12-13 11:36:00 2023-12-13 11:40:00          6   NaN   True
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4   3.0  False
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10   NaN   True
2 2023-12-13 11:40:00 2023-12-13 12:10:00          0   NaN   True
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10  23.0  False
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10   7.0  False
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10   9.0  False
修正结尾

我们计算每组的反向、移位

cumsum
,转换为时间增量并从每组的最后时间中减去(通过
groupby.transform('last')
获得):

           Start_time            End_time  Time_diff   Bus  reverse_cumsum_shift    as_timedelta        sub_from_end       End_corrected
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10  10.0                   NaN             NaT                 NaT 2023-12-13 11:30:00
1 2023-12-13 11:36:00 2023-12-13 11:40:00          6   NaN                   4.0 0 days 00:04:00 2023-12-13 11:36:00 2023-12-13 11:36:00
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4   3.0                   NaN             NaT                 NaT 2023-12-13 11:40:00
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10   NaN                  10.0 0 days 00:10:00 2023-12-13 12:00:00 2023-12-13 12:00:00
2 2023-12-13 11:40:00 2023-12-13 12:10:00          0   NaN                  10.0 0 days 00:10:00 2023-12-13 12:00:00 2023-12-13 12:00:00
2 2023-12-13 11:40:00 2023-12-13 12:10:00         10  23.0                   NaN             NaT                 NaT 2023-12-13 12:10:00
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10   7.0                   NaN             NaT                 NaT 2023-12-13 13:10:00
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10   9.0                   NaN             NaT                 NaT 2023-12-13 13:20:00
修正开始

现在我们有了正确的结束,我们为每个结束减去 Time_diff * 10 分钟以获得开始:

out['Start_time'] = out['End_time'].sub(pd.to_timedelta(out['Time_diff'], unit='min'))

           Start_time            End_time  Time_diff   Bus
0 2023-12-13 11:20:00 2023-12-13 11:30:00         10  10.0
1 2023-12-13 11:30:00 2023-12-13 11:36:00          6   NaN
1 2023-12-13 11:36:00 2023-12-13 11:40:00          4   3.0
2 2023-12-13 11:50:00 2023-12-13 12:00:00         10   NaN
2 2023-12-13 12:00:00 2023-12-13 12:00:00          0   NaN
2 2023-12-13 12:00:00 2023-12-13 12:10:00         10  23.0
3 2023-12-13 13:00:00 2023-12-13 13:10:00         10   7.0
4 2023-12-13 13:10:00 2023-12-13 13:20:00         10   9.0
© www.soinside.com 2019 - 2024. All rights reserved.