我有一个 python pandas 数据框,其中包含多个事件开始和结束的日期时间值。我想构建某个日期时间(精确到最近的分钟)在事件的开始和结束日期时间之间的所有时间的计数。我使用字典来构建计数,然后创建数据框,删除日期组件并按时间字段总结所有计数。
我有以下代码,它可以工作,但我发现使用较大的数据帧(1000 行而不是 100 行),处理时间太慢以至于无用。我通常是矢量化的粉丝,但似乎无法解决这个问题。
工作代码(但速度慢):
# Creating the dataframe
import pandas as pd
from datetime import date
data = {"Date":[date(2024,1,5),date(2024,1,6),date(2024,1,7)],"Start":[pd.Timestamp("2024-01-05 10:05"),pd.Timestamp("2024-01-06 09:05"),pd.Timestamp("2024-01-07 11:12")],"Finish":[pd.Timestamp("2024-01-05 10:35"),pd.Timestamp("2024-01-06 09:55"),pd.Timestamp("2024-01-07 11:58")]}
df = pd.DataFrame(data)
#Creating the Date range
dates = pd.date_range(start=df["Date"].min(), end=df["Date"].max(),freq="1min")
#Dictionary of dates to store the cummulative score
d_data = {}
for x in dates:
d_data[x] = 0
接下来的代码,我想改进,逐行迭代,以获得速度性能。
#Iterating by each line of the df and through the stored dates
for index,row in df.iterrows():
for d in dates:
if (d >= row[1]) & (d <= row[2]):
d_data[d] += 1
然后使用字典创建数据框
df_data = pd.DataFrame(index=d_data.keys(),data=d_data.values(),columns=["Count"])
df_data.reset_index(names="Date",inplace=True)
# Removing the date component to just leave the time and sum each occurance
df_data["Date"] = df_data["Date"].dt.time
df_data["Date"] = df_data["Date"].astype("str")
df_data["Date"] = df_data["Date"].str[:-3]
df_data = df_data.groupby("Date").sum().reset_index()
输出:
| Date | Count
0 | 00:00 | 0
1 | 00:01 | 0
545| 09:05 | 1
546| 09:06 | 1
# count the number of occurrences for each minute
date_ranges_list = []
for i, row in df.iterrows():
date_range = pd.date_range(start=row["Start"], end=row["Finish"], freq="1min")
date_ranges_list.append(pd.Series(date_range))
all_date_ranges = pd.concat(date_ranges_list)
minute_counts = all_date_ranges.value_counts().sort_index()
# if you want, you can add rows with zero
all_dates = pd.Series(
0, index=pd.date_range(start=df["Date"].min(), end=df["Date"].max(), freq="1min")
)
minute_counts = all_dates.add(minute_counts, fill_value=0)
# convert to dataframe
out = minute_counts.to_frame(name="Count")
结果:
Count
2024-01-05 00:00:00 0.0
2024-01-05 00:01:00 0.0
2024-01-05 00:02:00 0.0
2024-01-05 00:03:00 0.0
2024-01-05 00:04:00 0.0
... ...
2024-01-07 11:54:00 1.0
2024-01-07 11:55:00 1.0
2024-01-07 11:56:00 1.0
2024-01-07 11:57:00 1.0
2024-01-07 11:58:00 1.0