我知道有很多答案/方法可以在 Python 或 Pandas Dataframe 中获得 WoW 差异,但我发现其中许多在我的数据的边缘情况下不起作用。我有一个包含每小时交易时间戳和每小时销售额的数据框,我想计算每个小时和工作日的魔兽世界差异。我之前在 SQL 中通过简单的自连接完成了此操作,但我需要在 Python 中完成此操作。数据可能会丢失几个小时,并且轮班之类的事情将不可靠。
数据帧快照:
transaction_time | total_sale
2024-01-05 00:00:00 | 3500
2024-01-05 02:00:00 | 4000
.
.
.
2024-01-12 00:00:00 | 3400
2024-01-12 01:00:00 | 3200
2024-01-12 02:00:00 | 4100
.
.
.
2024-01-19 00:00:00 | 5000
2024-01-19 01:00:00 | 4200
我会在 SQL 中执行如下操作:
with
base_tbl AS
(
select
transaction_time,
total_sale
from table
where transaction_time betwwn <desired timefarame>
)
select
t. transaction_time
, SAFE_SUBTRACT(SAFE_DIVIDE(t. total_sale, c. total_sale), 1) AS wow_percent_change
FROM
base_tbl AS t
FULL OUTER JOIN
base_tbl AS c
ON EXTRACT(HOUR FROM t. transaction_time) = EXTRACT(HOUR FROM c. transaction_time)
AND TIMESTAMP_DIFF(t. transaction_time, c. transaction_time, DAY)=7
我尝试了一些基本上查看行之间固定间隔的方法,例如
shift
,例如df['wow_change']=df['total_sale'] - df['total_sale'].shift(168)
,但这对于我的数据来说不是一种可靠的方法。有没有更防错的方法来做到这一点?
我想要的输出是这样的:
transaction_time | wow_chane
2024-01-05 00:00:00 | nan
2024-01-05 02:00:00 | nan
.
.
.
2024-01-12 00:00:00 | -100
2024-01-12 01:00:00 | nan
2024-01-12 02:00:00 | 100
.
.
.
2024-01-19 00:00:00 | 1600
2024-01-19 01:00:00 | 1000
您可以使用 pandas
merge
函数加入日期来计算此值。
import pandas as pd
from datetime import timedelta
def calculate_wow(df):
df['hour'] = pd.to_datetime(hour) # Assuming the dataframe has an hour column
new_df = df.copy()
new_df['7days'] = new_df['hour'] + timedelta(days=7)
final_df = df.merge(new_df, left_on='hour', right_on='7days', how='left')
final_df['WoW'] = final_df['total_sale_x'] - final_df['total_sale_y']
final_df = final_df[['hour_x', 'WoW']]
return final_df