满足特定条件时在 DataFrame 中的行之间创建行的快速函数 | Python 熊猫

问题描述 投票:0回答:1

我正在寻找一种更快的替代方法来编辑 pandas DataFrame 的行(拆分、添加)。

代码如下:

# Definition of the start
start_0 = dt.datetime(2023,4,1,0,0,0)

# Create a DataFrame with columns Start, End and Info
# Calculate the duration between End and Start
data = (pd.DataFrame({"Start": [(start_0 + dt.timedelta(hours=i)) for i in range(0,24*10_000)],
                      "End": [(start_0 + dt.timedelta(hours=i+1)) for i in range(0,24*10_000)],
                      "Info": [f"column{i}" for i in range(0,24*10_000)]
                     })
        .assign(Duration = lambda df_:
                ((df_.End - df_.Start).dt.total_seconds())/60
               )
       )

data

                     Start                 End          Info  Duration
0      2023-04-01 00:00:00 2023-04-01 01:00:00       column0      60.0
1      2023-04-01 01:00:00 2023-04-01 02:00:00       column1      60.0
2      2023-04-01 02:00:00 2023-04-01 03:00:00       column2      60.0
3      2023-04-01 03:00:00 2023-04-01 04:00:00       column3      60.0
4      2023-04-01 04:00:00 2023-04-01 05:00:00       column4      60.0
...                    ...                 ...           ...       ...
239995 2050-08-16 19:00:00 2050-08-16 20:00:00  column239995      60.0
239996 2050-08-16 20:00:00 2050-08-16 21:00:00  column239996      60.0
239997 2050-08-16 21:00:00 2050-08-16 22:00:00  column239997      60.0
239998 2050-08-16 22:00:00 2050-08-16 23:00:00  column239998      60.0
239999 2050-08-16 23:00:00 2050-08-17 00:00:00  column239999      60.0

...这里是“慢”功能:

def split_data(data):
    """
    The function splits the rows of the data until the duration 15 minutes is reached
    
    Parameters:
        data (DataFrame): Data with Start, End and Info
        
    Returns:
        data (DataFrame): Data with max. duration of 15 min
    
    """
    
    # While-Statement
    while data.Duration.mean() != 15:
        
        # Create a empty list
        tuples = []
        
        # Loop through
        for row in data.itertuples():
            
            # append rows with half of the duration
            if row.Duration != 15:
                tuples.append((row.Start,
                               row.Start + dt.timedelta(minutes=row.Duration/2),
                               row.Info,
                               row.Duration/2))

                tuples.append((row.Start + dt.timedelta(minutes=row.Duration/2),
                               row.End,
                               row.Info,
                               row.Duration/2))
        
        # Create / overwrite the existing DataFrame
        data = pd.DataFrame(tuples,
                            columns=['Start', 'End', 'Info', 'Duration'])
        
    return data

split_data(data)

                     Start                 End          Info  Duration
0      2023-04-01 00:00:00 2023-04-01 00:15:00       column0      15.0
1      2023-04-01 00:15:00 2023-04-01 00:30:00       column0      15.0
2      2023-04-01 00:30:00 2023-04-01 00:45:00       column0      15.0
3      2023-04-01 00:45:00 2023-04-01 01:00:00       column0      15.0
4      2023-04-01 01:00:00 2023-04-01 01:15:00       column1      15.0
...                    ...                 ...           ...       ...
959995 2050-08-16 22:45:00 2050-08-16 23:00:00  column239998      15.0
959996 2050-08-16 23:00:00 2050-08-16 23:15:00  column239999      15.0
959997 2050-08-16 23:15:00 2050-08-16 23:30:00  column239999      15.0
959998 2050-08-16 23:30:00 2050-08-16 23:45:00  column239999      15.0
959999 2050-08-16 23:45:00 2050-08-17 00:00:00  column239999      15.0


%timeit split_data(data) 
14.5 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

对于相同的任务是否有更快的方法?我总是尝试使用 Pandas 矢量化,但在这种情况下我不知道。

提前谢谢你!

最好的问候,

迈克尔

我正在为这项任务寻找更快的方法

python pandas performance for-loop vectorization
1个回答
1
投票

假设最终持续时间/频率是原始持续时间的除数,您可以使用

repeat
和一些算术:

FREQ = 15

n = np.ceil(data['Duration'].div(FREQ))

out = (data
 .loc[data.index.repeat(n)]
 .assign(Duration=lambda d: d['Duration'].div(n))
)

g = out.groupby(level=0)
out['Start'] += pd.to_timedelta(g['Duration'].cumsum().sub(FREQ), unit='min')
out['End'] -= pd.to_timedelta(g['Duration'].cumcount(ascending=False).mul(FREQ), unit='min')

输出:

                     Start                 End          Info  Duration
0      2023-04-01 00:00:00 2023-04-01 00:15:00       column0      15.0
0      2023-04-01 00:15:00 2023-04-01 00:30:00       column0      15.0
0      2023-04-01 00:30:00 2023-04-01 00:45:00       column0      15.0
0      2023-04-01 00:45:00 2023-04-01 01:00:00       column0      15.0
1      2023-04-01 01:00:00 2023-04-01 01:15:00       column1      15.0
...                    ...                 ...           ...       ...
239998 2050-08-16 22:45:00 2050-08-16 23:00:00  column239998      15.0
239999 2050-08-16 23:00:00 2050-08-16 23:15:00  column239999      15.0
239999 2050-08-16 23:15:00 2050-08-16 23:30:00  column239999      15.0
239999 2050-08-16 23:30:00 2050-08-16 23:45:00  column239999      15.0
239999 2050-08-16 23:45:00 2050-08-17 00:00:00  column239999      15.0

[960000 rows x 4 columns]
© www.soinside.com 2019 - 2024. All rights reserved.