我有一个如下所示的数据框(称为 df):
我正在尝试获取所有周末的“交易量”值(“WEEKDAY”列 = 5(星期六)或 6(星期日)的值)并将它们加到随后的星期一(WEEKDAY = 0)。
我尝试了一些东西,但没有任何效果,以最后三行为例:
我期待的是:
重现问题:
!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv
df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df = df_volume_noticias.set_index('Datas')
df['WEEKDAY'] = df.index.dayofweek
df
这使用 pd.shift 解决了您的问题。
import pandas as pd
df['prior_volume'] = df.Volume.shift(1)
df['prior_volume2'] = df.Volume.shift(2)
df.loc[df['WEEKDAY'] == 0, 'Volume'] = df.loc[df['WEEKDAY'] == 0, 'prior_volume'] + \
df.loc[df['WEEKDAY'] == 0, 'prior_volume2'] + \
df.loc[df['WEEKDAY'] == 0, 'Volume']
df = df[df['WEEKDAY'].isin(range(5))]
df = df[['Volume', 'WEEKDAY']]
df.head(10)
我用.groupby来解决问题。
import pandas as pd
df = pd.read_csv('volume_por_dia.csv')
df['Datas'] = pd.to_datetime(df['Datas'])
df['WEEKDAY'] = df['Datas'].dt.dayofweek
df['index'] = df['Datas']
# Group df by date, setting frequency as week
#(beginning Tue - so that Sat and Sun will be added to the next Mon)
df_group = df.groupby([pd.Grouper(key = 'Datas', freq='W-MON'), \
'WEEKDAY', 'index']).agg({'Volume': 'sum'})
# In each group, add days 5, 6 (Sat and Sun) to day 0 (Mon)
df_group.loc[(slice(None), 0), 'Volume'] += \
df_group.loc[(slice(None), [5, 6]), 'Volume'].groupby(level=0).sum()
# In the grouped data, remove Sat and Sun
df_group = df_group.reset_index()
df_group = df_group[df_group['WEEKDAY'] != 5]
df_group = df_group[df_group['WEEKDAY'] != 6]
# Remove volume data from original df, and merge with volume from df_group
df = df.drop(['Volume'], axis=1)
df = pd.merge(df,df_group[['index','Volume']],on='index', how='left')
df = df.dropna(subset=['Volume'])
df = df.drop(['index'], axis=1)
# Optional: sort dates in ascending order
df = df.sort_values(by=['Datas'])
print (df)
您可以简单地遍历行并从周五开始累积交易量,并更新周日交易量中的值。然后,删除星期五和星期六的行。
values = df.values
volume_accumulated = 0
for idx, row in enumerate(values):
if row[1] in (5, 6):
volume_accumulated += row[0]
elif row[1] == 0:
volume_accumulated += row[0]
df["Volume"][idx] = volume_accumulated
else:
volume_accumulated = 0
df = df[~df["WEEKDAY"].isin([5, 6])]
输入:
!wget https://raw.githubusercontent.com/brunodifranco/TCC/main/volume_por_dia.csv
import pandas as pd
import numpy as np
df = pd.read_csv('volume_por_dia.csv').sort_values('Datas',ascending=True)
df['Datas'] = pd.to_datetime(df['Datas'])
df.set_index('Datas', inplace=True)
df['WEEKDAY'] = df.index.dayofweek
我假设索引日期已排序,
Datas
索引是唯一的并且没有缺失的日期。我无法做出的一些假设是:
由于这些原因,在计算周末量之前,我首先提取第一个星期六和最后一个星期一的日期:
first_saturday = df.index[df.WEEKDAY==5][0]
last_monday = df.index[df.WEEKDAY==0][-1]
现在我可以提取周末的体积,确保我总是有周六-周日的一对,并且对于每一对,数据框中存在下一个周一:
df_weekend = df.loc[
(df.WEEKDAY.isin([5,6]))&
(df.index<=last_monday)&
(df.index>=first_saturday)
]
df_weekend
现在,因为我有几个星期六和星期日的卷,我可以按以下方式计算总和:
weekend_volumes = pd.Series(
df_weekend.Volume.values.reshape(-1,2).sum(axis=1), #sum of volume couples
index = df_weekend.index[1::2]+pd.Timedelta("1d"), #date of the following monday
name="weekend_volume"
).reindex(df.index).fillna(0) #zero weekend-volume for days that are not mondays
weekend_volumes
最后将周末量添加到起始量:
df["Volume"] = df.Volume+weekend_volumes
我在下面附上 df 的最后 25 行:
# 2022-02-18 16.0 4
# 2022-02-19 2.0 5
# 2022-02-20 1.0 6
# 2022-02-21 10.0 0
# 2022-02-22 43.0 1
# 2022-02-23 36.0 2
# 2022-02-24 38.0 3
# 2022-02-25 28.0 4
# 2022-02-26 5.0 5
# 2022-02-27 3.0 6
# 2022-02-28 14.0 0
# 2022-03-01 10.0 1
# 2022-03-02 16.0 2
# 2022-03-03 18.0 3
# 2022-03-04 11.0 4
# 2022-03-05 8.0 5
# 2022-03-06 2.0 6
# 2022-03-07 32.0 0
# 2022-03-08 18.0 1
# 2022-03-09 32.0 2
# 2022-03-10 24.0 3
# 2022-03-11 18.0 4
# 2022-03-12 4.0 5
# 2022-03-13 1.0 6
# 2022-03-14 10.0 0
在这里添加2个解决方案:
使用
pd.shift
(Lukas Hestermeyer 之前指出;我添加了一个简化版本)
使用滚动窗口(这实际上是一个单行)
两种解决方案都假设;
Dates
按升序排序(如果不是,则应在继续之前进行排序)第 1 部分 |数据准备:
import pandas as pd
import numpy as np
# STEP 1: Create DF
Datas = [
'2019-07-02',
'2019-07-03',
'2019-07-04',
'2019-07-05',
'2019-07-06',
'2019-07-07',
'2019-07-08',
'2022-03-10',
'2022-03-11',
'2022-03-12',
'2022-03-13',
'2022-03-14'
]
Volume = [17, 30, 20, 21, 5, 10, 12, 24, 18, 4, 1, 5]
WEEKDAY = [1, 2, 3, 4, 5, 6, 0, 3, 4, 5, 6, 0]
dic = {'Datas': Datas, 'Volume': Volume, 'WEEKDAY': WEEKDAY}
df = pd.DataFrame(dic)
第 2 部分 |解决方案:
解决方案 1 [pd.shift]:
# STEP 1: add shifts
df['shift_1'] = df['Volume'].shift(1)
df['shift_2'] = df['shift_1'].shift(1)
# STEP 2: sum Volume with shifts where weekday==0
cols_to_sum = ['Volume', 'shift_1', 'shift_2']
df['Volume'] = df[['WEEKDAY'] + cols_to_sum].apply(lambda x: int(x[1]) if x[0] else int(x[1] + x[2] + x[3]), axis=1)
df = df.drop(['shift_1', 'shift_2'], axis=1)
df
解决方案2 [滚动窗口]:
# use rolling window of size 3 to sum where weekday == 0
df['Volume'] = np.where(
df['WEEKDAY'] == 0,
df['Volume'].rolling(window=3, center=False).sum(),
df['Volume']
)
df
第 3 部分 |删除周末记录:
df.loc[~df['WEEKDAY'].isin([5, 6])]
df
如果你考虑星期从星期二开始,问题就变得简单了。您只需要获取周末的值并将其加到该周的星期一(即周末后的星期一)。这将自动处理您的数据可能在周末开始/结束的情况。
import numpy as np
import pandas as pd
np.random.seed(1)
# Sample data
dates = pd.date_range('2018-02-05', '2018-07-22', freq='D')
volume = np.random.randint(1, 50, len(dates))
df = pd.DataFrame(dict(Datas=dates, Volume=volume))
df = df.set_index('Datas')
# Week starting from Tuesday
week = ((df.index - pd.DateOffset(days=1)).isocalendar().week).values
def add_weekend_to_monday(week):
monday = week.index.weekday == 0
weekend = week.index.weekday >= 5
week[monday] += week[weekend].sum()
return week
df['Volume'] = df.groupby(week)['Volume'].apply(add_weekend_to_monday)