与在Proc SQL中使用一对多连接相比，是否有更快的方法来生成所需的输出？

Question

我需要一个输出，显示24小时滚动窗口中的总工作小时数。当前存储的数据使得每个行是每人一小时的时段（例如，1月2日的7-8am）以及他们在该小时中工作的多少存储为“小时”。我需要创建的是另一个字段，它是每行最近24小时插槽（包括）的总和。因此，对于上面的7-8am示例，我希望24行中的“小时”总和：1月1日上午8-9点，1月1日上午9点到10点1月2日上午6点到7点，1月2日上午7点到8点。

每小时插槽冲洗并重复一次。

有6000人，我们有6个月的数据，这意味着该表有6000 * 183天* 24小时= 26.3米行。

我目前使用下面的代码完成了这个，它可以很容易地对50个人的样本进行处理，但是当我在完整的桌子上尝试它时会停止，有点可以理解。

有没有人有任何其他想法？所有日期/时间变量均采用日期时间格式。

proc sql;
create table want as
 select x.*
 , case when Hours_Wrkd_In_Window > 16 then 1 else 0 end as Correct 
 from (
  select a.ID
  , a.Start_DTTM
  , a.End_DTTM
  , sum(b.hours) as Hours_Wrkd_In_Window
  from have a
   left join have b
   on a.ID = b.ID
   and b.start_dttm > a.start_dttm - (24 * 60 * 60)
   and b.start_dttm <= a.start_dttm
  where datepart(a.Start_dttm) >= &report_start_date.
  and datepart(a.Start_dttm) < &report_end_date.
  group by ID
  , a.Start_DTTM
  , a.End_DTTM  
) x
order by x.ID
, x.Start_DTTM
;quit;

Answer 1

在连接表中访问的列的复合索引 - id + start_dttm + hours - 如果没有已经存在的那么将是有用的。

使用msglevel=i将打印一些有关查询执行方式的诊断信息。它可能会提供一些额外的提示。

Answer 2

性能最佳的DATA步骤解决方案最有可能涉及一个环阵列来跟踪1小时的时隙和工作时间。环将允许基于进出环的内容计算滚动聚合（总和和计数）。

如果您拥有广泛的SAS许可证，请查看SAS / ETS（计量经济学和时间序列）中的过程。 Proc EXPAND可能具有一些滚动聚合功能。

此示例DATA步骤代码花费<10s（SSD上的WORK文件夹）在6k人的模拟数据上运行，完成6小时时间的1小时时间。

data have(keep=id start_dt end_dt hours);
  do id = 1 to 6000;

    do start_dt 
     = intnx('dtmonth', datetime(), -12)
    to intnx('dtmonth', datetime(), -6)
    by dhms(0,1,0,0)
    ;
      end_dt = start_dt + dhms(0,1,0,0);

      hours = 0.25 * floor (5 * ranuni(123)); * 0, 1/4, 1/2, 3/4 or 1 hour;

      output;
    end;
  end;

  format hours 5.2;
run;

/* %let log= ; options obs=50 linesize=200; * submit this (instead of next) if you want to log the logic; */

%let log=*; options obs=max;

data want2(keep=id start_dt end_dt hours hours_rolling_sum hours_rolling_cnt hours_out_:);

  array dt_ring(24) _temporary_;
  array hr_ring(24) _temporary_;

  call missing (of dt_ring(*));
  call missing (of hr_ring(*));

  if 0 then set have; * prep pdv column order;

  hours_rolling_sum = 0;
  hours_rolling_cnt = 0;
  label hours_rolling_sum = 'Hours worked in prior 24 hours';

  index = 0;

  do until (last.id);
    set have;
    by id start_dt;

    index + 1;
    if index > 24 then index = 1;

    hours_out_sum = 0;
    hours_out_cnt = 0;

    do clear = 1 by 1 until (clear=0);

      if sum (dt_ring(index), 0) = 0 then do;
        * index is first go through ring array, or hit a zeroed slot;

&log putlog 'NOTE: ' index= 'clear for empty ring item. ';

        clear = 0;
      end;
      else
      if start_dt - dt_ring(index) >= %sysfunc(dhms(0,24,0,0)) then do;

&log putlog / 'NOTE: ' index= 'reducting and zeroing.' /;

        hours_out_sum + hr_ring(index);
        hours_out_cnt + 1;

        hours_rolling_sum = hours_rolling_sum - hr_ring(index);
        hours_rolling_cnt = hours_rolling_cnt - 1;
        dt_ring(index) = 0;
        hr_ring(index) = 0;

        * advance item to next item, that might also be more than 24 hours ago;
        index = index + 1;
        if index > 24 then index = 1;

      end;
      else do;

&log putlog / 'NOTE: ' index= 'back off !' /;

        * index was advanced to an item within 24 hours, back off one;
        index = index - 1;
        if index < 1 then index = 24;
        clear = 0;
      end;

    end; /* do clear */

    dt_ring(index) = start_dt;
    hr_ring(index) = hours;

    hours_rolling_sum + hours;
    hours_rolling_cnt + 1;

&log putlog 'NOTE: ' index= 'overlaying and aggregating.' / 'NOTE:  ' start_dt= hours= hours_rolling_sum= hours_rolling_cnt=;

    output;
  end; /* do until */

  format hours_rolling_sum 5.2 hours_rolling_cnt 2.; 
  format hours_out_sum 5.2 hours_out_cnt 2.;
run;

options obs=max;

在查看结果时，您应该注意到hours_rolling_sum的增量为+（插槽中的小时数） - （hours_out_sum {从环中移除的小时数}）

如果你必须使用SQL，我会建议关注@jspascal并对表进行索引，但重新排列查询以将连接原始数据连接到内部连接的子选择（这样SQL将在ids上执行涉及散列连接的索引）。对于相同数量的少数人来说，它应该比原始查询更快，但对于完成所有6K仍然太慢。

proc sql; 
  create index id on have;
  create index id_slot on have (id, start_dt);
quit;

proc sql _method;

  reset inobs=50; * limit data so you can see the _method;

  create table want as
  select
    have.*
  , case 
      when ROLLING.HOURS_WORKED_24_HOUR_PRIOR > 16 
      then 1 
      else 0
    end as REVIEW_TIME_CLOCKING_FLAG

  from 
    have
  left join
  (
    select
        EACH_SLOT.id
      , EACH_SLOT.start_dt
      , count(*) as SLOT_COUNT_24_HOUR_PRIOR
      , sum(PRIOR_SLOT.hours) as HOURS_WORKED_24_HOUR_PRIOR
      from 
        have as EACH_SLOT
      join
        have as PRIOR_SLOT
      on
        EACH_SLOT.ID = PRIOR_SLOT.ID
        and EACH_SLOT.start_dt - PRIOR_SLOT.start_dt between 0 and %sysfunc(dhms(0,24,0,0))-0.1
      group by
        EACH_SLOT.id, EACH_SLOT.start_dt
    ) as ROLLING

    on
      have.ID = ROLLING.ID
      and have.start_dt = ROLLING.start_dt

    order by
        id, start_dt
    ;

  %put NOTE: SQLOOPS = &SQLOOPS;
quit;

内连接是金字塔状的，仍然涉及很多内部循环。

与在Proc SQL中使用一对多连接相比，是否有更快的方法来生成所需的输出？

问题描述投票：0回答：2

2个回答

最新问题

与在Proc SQL中使用一对多连接相比，是否有更快的方法来生成所需的输出？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2