原始数据:
subject medgrp stdt endt
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
重组数据:
subject med_pattern stdt_new endt_new
1 A*B 7/1/2014 7/31/2014
1 A*B*C 8/1/2014 8/15/2014
1 A*C 8/16/2014 8/30/2014
1 C 8/31/2014 9/1/2014
2 A 4/15/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 8/1/2014
通过将所有记录的stdt
输出到endt
,然后将每个subject/medgrp
保留一个日期,重新设置日期周期并创建变量med_pattern
,我能够将原始数据转换为重组数据。
但是,此方法需要很长时间才能运行,尤其是对于大数据(> 3m条记录)。
任何使它更有效的建议将不胜感激!
通过subject
,您可以使用日期键控多数据散列来跟踪medgrp
和stdt
定义的日期范围内每个日期的endt
活动。哈希的迭代将使您计算medgrps crossings值。
data have; input
subject medgrp $ stdt: mmddyy8. endt: mmddyy8.; format stdt endt mmddyy10.;
datalines;
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 A 7/15/2014 7/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
;
data crossings_by_date / view=crossings_by_date;
if 0 then set have; * prep PDV;
if _n_ then do;
declare hash dg(multidata:'yes', ordered:'a'); %* 1st hash for subject dates;
dg.defineKey('date');
dg.defineData('date', 'medgrp');
dg.defineDone();
call missing (date); format date adate cdate mmddyy10.;
declare hash crossing(ordered:'a'); %* 2nd hash for deduping a list of medgrps ;
crossing.defineKey('medgrp');
crossing.defineData('medgrp');
crossing.defineDone();
declare hiter dgi('dg');
declare hiter xi('crossing');
end;
dg.clear();
do _n_ = 1 by 1 until (last.subject); * process subjects one by one;
set have;
by subject;
do date = stdt to endt; * load multidata hash with medgrp over date range;
dg.add();
end;
end;
* examine each date in which subject had activity;
adate = .;
cdate = -1e9;
do _i_ = 1 by 1 while (dgi.next() = 0);
if date eq adate
then continue; * hiter over multi-data will return each node;
else adate = date; * track activity date;
* load hash to dedupe tracking of medgrp on date;
crossing.clear();
do _i_ = 1 by 1 while (dg.do_over() = 0);
crossing.replace();
end;
* compute crossing representation on date, A*B*... by traversing 2nd hash;
xi.first(); length cross $20;
cross = medgrp;
do while(0 = xi.next());
cross = catx('*',cross,medgrp);
end;
if date - cdate > 1 then cluster + 1; %* track cluster based on date continuities;
cdate = date;
output; * <------------ view OUTPUT;
end;
keep subject date cross cluster;
run;
* 2nd data step processes view (1st data step);
* determine when date continuity ends or medgrp changes;
data want;
length subject 8 medgrps $20;
format stdt endt mmddyy10.;
do _n_ = 1 by 1 until (last.medgrps);
set crossings_by_date (rename=cross=medgrps);
by cluster medgrps notsorted;
if stdt = . then
stdt = date;
end;
endt = date;
keep subject medgrps stdt endt;
run;