我正在寻找一种更快的替代方法来编辑 pandas DataFrame 的行(拆分、添加)。
代码如下:
# Definition of the start
start_0 = dt.datetime(2023,4,1,0,0,0)
# Create a DataFrame with columns Start, End and Info
# Calculate the duration between End and Start
data = (pd.DataFrame({"Start": [(start_0 + dt.timedelta(hours=i)) for i in range(0,24*10_000)],
"End": [(start_0 + dt.timedelta(hours=i+1)) for i in range(0,24*10_000)],
"Info": [f"column{i}" for i in range(0,24*10_000)]
})
.assign(Duration = lambda df_:
((df_.End - df_.Start).dt.total_seconds())/60
)
)
data
Start End Info Duration
0 2023-04-01 00:00:00 2023-04-01 01:00:00 column0 60.0
1 2023-04-01 01:00:00 2023-04-01 02:00:00 column1 60.0
2 2023-04-01 02:00:00 2023-04-01 03:00:00 column2 60.0
3 2023-04-01 03:00:00 2023-04-01 04:00:00 column3 60.0
4 2023-04-01 04:00:00 2023-04-01 05:00:00 column4 60.0
... ... ... ... ...
239995 2050-08-16 19:00:00 2050-08-16 20:00:00 column239995 60.0
239996 2050-08-16 20:00:00 2050-08-16 21:00:00 column239996 60.0
239997 2050-08-16 21:00:00 2050-08-16 22:00:00 column239997 60.0
239998 2050-08-16 22:00:00 2050-08-16 23:00:00 column239998 60.0
239999 2050-08-16 23:00:00 2050-08-17 00:00:00 column239999 60.0
...这里是“慢”功能:
def split_data(data):
"""
The function splits the rows of the data until the duration 15 minutes is reached
Parameters:
data (DataFrame): Data with Start, End and Info
Returns:
data (DataFrame): Data with max. duration of 15 min
"""
# While-Statement
while data.Duration.mean() != 15:
# Create a empty list
tuples = []
# Loop through
for row in data.itertuples():
# append rows with half of the duration
if row.Duration != 15:
tuples.append((row.Start,
row.Start + dt.timedelta(minutes=row.Duration/2),
row.Info,
row.Duration/2))
tuples.append((row.Start + dt.timedelta(minutes=row.Duration/2),
row.End,
row.Info,
row.Duration/2))
# Create / overwrite the existing DataFrame
data = pd.DataFrame(tuples,
columns=['Start', 'End', 'Info', 'Duration'])
return data
split_data(data)
Start End Info Duration
0 2023-04-01 00:00:00 2023-04-01 00:15:00 column0 15.0
1 2023-04-01 00:15:00 2023-04-01 00:30:00 column0 15.0
2 2023-04-01 00:30:00 2023-04-01 00:45:00 column0 15.0
3 2023-04-01 00:45:00 2023-04-01 01:00:00 column0 15.0
4 2023-04-01 01:00:00 2023-04-01 01:15:00 column1 15.0
... ... ... ... ...
959995 2050-08-16 22:45:00 2050-08-16 23:00:00 column239998 15.0
959996 2050-08-16 23:00:00 2050-08-16 23:15:00 column239999 15.0
959997 2050-08-16 23:15:00 2050-08-16 23:30:00 column239999 15.0
959998 2050-08-16 23:30:00 2050-08-16 23:45:00 column239999 15.0
959999 2050-08-16 23:45:00 2050-08-17 00:00:00 column239999 15.0
%timeit split_data(data)
14.5 s ± 139 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
对于相同的任务是否有更快的方法?我总是尝试使用 Pandas 矢量化,但在这种情况下我不知道。
提前谢谢你!
最好的问候,
迈克尔
我正在为这项任务寻找更快的方法
假设最终持续时间/频率是原始持续时间的除数,您可以使用
repeat
和一些算术:
FREQ = 15
n = np.ceil(data['Duration'].div(FREQ))
out = (data
.loc[data.index.repeat(n)]
.assign(Duration=lambda d: d['Duration'].div(n))
)
g = out.groupby(level=0)
out['Start'] += pd.to_timedelta(g['Duration'].cumsum().sub(FREQ), unit='min')
out['End'] -= pd.to_timedelta(g['Duration'].cumcount(ascending=False).mul(FREQ), unit='min')
输出:
Start End Info Duration
0 2023-04-01 00:00:00 2023-04-01 00:15:00 column0 15.0
0 2023-04-01 00:15:00 2023-04-01 00:30:00 column0 15.0
0 2023-04-01 00:30:00 2023-04-01 00:45:00 column0 15.0
0 2023-04-01 00:45:00 2023-04-01 01:00:00 column0 15.0
1 2023-04-01 01:00:00 2023-04-01 01:15:00 column1 15.0
... ... ... ... ...
239998 2050-08-16 22:45:00 2050-08-16 23:00:00 column239998 15.0
239999 2050-08-16 23:00:00 2050-08-16 23:15:00 column239999 15.0
239999 2050-08-16 23:15:00 2050-08-16 23:30:00 column239999 15.0
239999 2050-08-16 23:30:00 2050-08-16 23:45:00 column239999 15.0
239999 2050-08-16 23:45:00 2050-08-17 00:00:00 column239999 15.0
[960000 rows x 4 columns]