我需要一个输出,显示24小时滚动窗口中的总工作小时数。当前存储的数据使得每个行是每人一小时的时段(例如,1月2日的7-8am)以及他们在该小时中工作的多少存储为“小时”。我需要创建的是另一个字段,它是每行最近24小时插槽(包括)的总和。因此,对于上面的7-8am示例,我希望24行中的“小时”总和:1月1日上午8-9点,1月1日上午9点到10点1月2日上午6点到7点,1月2日上午7点到8点。
每小时插槽冲洗并重复一次。
有6000人,我们有6个月的数据,这意味着该表有6000 * 183天* 24小时= 26.3米行。
我目前使用下面的代码完成了这个,它可以很容易地对50个人的样本进行处理,但是当我在完整的桌子上尝试它时会停止,有点可以理解。
有没有人有任何其他想法?所有日期/时间变量均采用日期时间格式。
proc sql;
create table want as
select x.*
, case when Hours_Wrkd_In_Window > 16 then 1 else 0 end as Correct
from (
select a.ID
, a.Start_DTTM
, a.End_DTTM
, sum(b.hours) as Hours_Wrkd_In_Window
from have a
left join have b
on a.ID = b.ID
and b.start_dttm > a.start_dttm - (24 * 60 * 60)
and b.start_dttm <= a.start_dttm
where datepart(a.Start_dttm) >= &report_start_date.
and datepart(a.Start_dttm) < &report_end_date.
group by ID
, a.Start_DTTM
, a.End_DTTM
) x
order by x.ID
, x.Start_DTTM
;quit;
在连接表中访问的列的复合索引 - id
+ start_dttm
+ hours
- 如果没有已经存在的那么将是有用的。
使用msglevel=i
将打印一些有关查询执行方式的诊断信息。它可能会提供一些额外的提示。
性能最佳的DATA
步骤解决方案最有可能涉及一个环阵列来跟踪1小时的时隙和工作时间。环将允许基于进出环的内容计算滚动聚合(总和和计数)。
如果您拥有广泛的SAS许可证,请查看SAS / ETS(计量经济学和时间序列)中的过程。 Proc EXPAND可能具有一些滚动聚合功能。
此示例DATA步骤代码花费<10s(SSD上的WORK文件夹)在6k人的模拟数据上运行,完成6小时时间的1小时时间。
data have(keep=id start_dt end_dt hours);
do id = 1 to 6000;
do start_dt
= intnx('dtmonth', datetime(), -12)
to intnx('dtmonth', datetime(), -6)
by dhms(0,1,0,0)
;
end_dt = start_dt + dhms(0,1,0,0);
hours = 0.25 * floor (5 * ranuni(123)); * 0, 1/4, 1/2, 3/4 or 1 hour;
output;
end;
end;
format hours 5.2;
run;
/* %let log= ; options obs=50 linesize=200; * submit this (instead of next) if you want to log the logic; */
%let log=*; options obs=max;
data want2(keep=id start_dt end_dt hours hours_rolling_sum hours_rolling_cnt hours_out_:);
array dt_ring(24) _temporary_;
array hr_ring(24) _temporary_;
call missing (of dt_ring(*));
call missing (of hr_ring(*));
if 0 then set have; * prep pdv column order;
hours_rolling_sum = 0;
hours_rolling_cnt = 0;
label hours_rolling_sum = 'Hours worked in prior 24 hours';
index = 0;
do until (last.id);
set have;
by id start_dt;
index + 1;
if index > 24 then index = 1;
hours_out_sum = 0;
hours_out_cnt = 0;
do clear = 1 by 1 until (clear=0);
if sum (dt_ring(index), 0) = 0 then do;
* index is first go through ring array, or hit a zeroed slot;
&log putlog 'NOTE: ' index= 'clear for empty ring item. ';
clear = 0;
end;
else
if start_dt - dt_ring(index) >= %sysfunc(dhms(0,24,0,0)) then do;
&log putlog / 'NOTE: ' index= 'reducting and zeroing.' /;
hours_out_sum + hr_ring(index);
hours_out_cnt + 1;
hours_rolling_sum = hours_rolling_sum - hr_ring(index);
hours_rolling_cnt = hours_rolling_cnt - 1;
dt_ring(index) = 0;
hr_ring(index) = 0;
* advance item to next item, that might also be more than 24 hours ago;
index = index + 1;
if index > 24 then index = 1;
end;
else do;
&log putlog / 'NOTE: ' index= 'back off !' /;
* index was advanced to an item within 24 hours, back off one;
index = index - 1;
if index < 1 then index = 24;
clear = 0;
end;
end; /* do clear */
dt_ring(index) = start_dt;
hr_ring(index) = hours;
hours_rolling_sum + hours;
hours_rolling_cnt + 1;
&log putlog 'NOTE: ' index= 'overlaying and aggregating.' / 'NOTE: ' start_dt= hours= hours_rolling_sum= hours_rolling_cnt=;
output;
end; /* do until */
format hours_rolling_sum 5.2 hours_rolling_cnt 2.;
format hours_out_sum 5.2 hours_out_cnt 2.;
run;
options obs=max;
在查看结果时,您应该注意到hours_rolling_sum的增量为+(插槽中的小时数) - (hours_out_sum {从环中移除的小时数})
如果你必须使用SQL,我会建议关注@jspascal并对表进行索引,但重新排列查询以将连接原始数据连接到内部连接的子选择(这样SQL将在ids上执行涉及散列连接的索引)。对于相同数量的少数人来说,它应该比原始查询更快,但对于完成所有6K仍然太慢。
proc sql;
create index id on have;
create index id_slot on have (id, start_dt);
quit;
proc sql _method;
reset inobs=50; * limit data so you can see the _method;
create table want as
select
have.*
, case
when ROLLING.HOURS_WORKED_24_HOUR_PRIOR > 16
then 1
else 0
end as REVIEW_TIME_CLOCKING_FLAG
from
have
left join
(
select
EACH_SLOT.id
, EACH_SLOT.start_dt
, count(*) as SLOT_COUNT_24_HOUR_PRIOR
, sum(PRIOR_SLOT.hours) as HOURS_WORKED_24_HOUR_PRIOR
from
have as EACH_SLOT
join
have as PRIOR_SLOT
on
EACH_SLOT.ID = PRIOR_SLOT.ID
and EACH_SLOT.start_dt - PRIOR_SLOT.start_dt between 0 and %sysfunc(dhms(0,24,0,0))-0.1
group by
EACH_SLOT.id, EACH_SLOT.start_dt
) as ROLLING
on
have.ID = ROLLING.ID
and have.start_dt = ROLLING.start_dt
order by
id, start_dt
;
%put NOTE: SQLOOPS = &SQLOOPS;
quit;
内连接是金字塔状的,仍然涉及很多内部循环。