have
是一个具有4个变量的sas数据集:id
和变量,用于存储受访者与其所在团队的3个不同成员共享的所有活动的信息。有4种不同的活动类型,由每个玩家(p1至p3)的:_activities
变量中填充的数字标识。以下是前5个Obs:
id p1_activities p2_activities p3_activities
A 1,2,3,4 1,3
B 1,3 1,2,3 1,2,3
C 1,2,3 1,2,3
D 1,2,3
E 1,2,3 1
考虑受访者A:他们与团队中的参与者1共享全部4个活动,并与团队2中的参与者2共享活动1和3。我需要为每个玩家位置和每个活动创建标志。例如,对于所有出现在p1_act2_flag
字符变量中的值为1
的受访者,新的数字变量2
应等于p1_activities
。这是我需要为显示的数据创建的12个总数中的前6个变量:
p1_act1_flag p1_act2_flag p1_act3_flag p1_act4_flag p2_act1_flag p2_act2_flag …
1 1 1 1 1 0 …
1 0 1 0 1 1 …
. . . . 1 1 …
. . . . 1 1 …
1 1 1 0 . . …
我现在通过在length语句中初始化所有变量名称,然后编写ton if-then语句来完成此操作。我想使用更少的代码行,但是我的数组逻辑不正确。这是我尝试为玩家1创建标志的方法:
data want;
length p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg
p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg
p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg
p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg 8.0;
set have;
array plracts {*} p1_activities p2_activities p3_activities;
array p1actflg {*} p1_act1_flg p1_act2_flg p1_act3_flg p1_act4_flg;
array p2actflg {*} p2_act1_flg p2_act2_flg p2_act3_flg p2_act4_flg;
array p3actflg {*} p3_act1_flg p3_act2_flg p3_act3_flg p3_act4_flg;
array p4actflg {*} p4_act1_flg p4_act2_flg p4_act3_flg p4_act4_flg;
do i=1 to 3;
do j=1 to 4;
if find(plracts{i}, cats(put(j, $12.))) then p1actflg{j}=1;
else if missing(plracts{i}) then p1actflg{j}=.;
else p1actflg{j}=0;
end;
end;
*do this again for the other p#actflg arrays;
run;
SAS告诉我,我的“数组下标超出范围”-但这仍然感觉不是很好的解决方案。
如何用更少的代码行来做到这一点?
不确定只有3个标记时为什么要处理4个标记活动。
一些想法:
activities_p1-activities_p3
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
DIM
保持在数组范围内。不少,但也许更健壮?
此代码检查活动列表中的每个项目,而不是查找特定项目的存在(1..4):
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activites);
if missing(activities[i]) then continue; %* skip;
do j = 1 by 1;
item = scan ( activities[i], j, ',' );
if missing(item) then leave; %* no more items in csv list;
item_num = input (item,?1.);
if missing(item_num) then continue; %* skip, csv item is not a number;
if item_num > hbound(flags,2) or item_num < lbound(flags,2) then do;
put 'WARNING:' item_num 'is invalid for flagging';
continue; %* skip, csv item is missing, 0, negative or exceeds 4;
end;
flags (i, item_num) = 1;
end;
* backfill zeroes where flag not assigned;
do j = 1 to hbound(flags,2);
flags (i, item_num) = sum (0, flags (i, item_num)); %* sum() handles missing values;
end;
end;
这里是相同的处理,但仅搜索要标记的特定项目:
data have; length id activities_p1-activities_p3 $20;input
id activities_p1-activities_p3 ; datalines;
A 1,2,3,4 1,3 .
B 1,3 1,2,3 1,2,3
C . 1,2,3 1,2,3
D . 1,2,3 .
E 1,2,3 . 1
;
data want;
set have;
array activities
activities_p1-activities_p3
;
array flags(3,4)
flag_p1_1-flag_p1_4
flag_p2_1-flag_p2_4
flag_p3_1-flag_p3_4
;
do i = 1 to dim(activities);
if not missing(activities[i]) then
do j = 1 to hbound(flags,2);
flags (i,j) = sum (flags(i,j), findw(trim(activities[i]),cats(j),',') > 0) > 0;
end;
end;
run;
发生了什么事?
hbound
返回4作为第二维的上限findw(trim(activities[i]),cats(j),',')
在csv字符串中找到j的位置trim
需要删除不属于findw
字定界符列表的尾部空格cats
将j数字转换为字符表示形式findw
返回j在csv字符串中的位置。compress
隔开空格和其他垃圾。> 0
评估位置到0
j不存在且1
存在> 0
是另一个逻辑求值,可确保j当前标志保持为0
或1
。否则,标记将是频率计数(想象活动数据1,1,2,3
)flags(i,j)
覆盖了3 x 4个可用于标记的插槽。考虑转换为分层视图并在其中执行逻辑。真正的问题在于,每个列表中可能缺少职位。因此,简单的do循环将很困难。更快的方法是多步骤:
它不像可以完成的单个数据步骤那样优雅,但是使用起来有点容易。
data have;
infile datalines dlm='|';
input id$ p1_activities$ p2_activities$ p3_activities$;
datalines;
A|1,2,3,4|1,3|
B|1,3|1,2,3|1,2,3|
C| |1,2,3|1,2,3|
D| |1,2,3|
E|1,2,3| |1
;
run;
/* Make a template of all possible players and positions */
data template;
set have;
array players p1_activities--p3_activities;
length varname $15.;
do player = 1 to dim(players);
do activity = 1 to 4;
/* Generate a variable name for later */
varname = cats('p', player, '_act', activity, '_flg');
output;
end;
end;
keep ID player activity varname;
run;
/* Create a list of actual players and their positions */
data actual;
set have;
array players p1_activities--p3_activities;
do player = 1 to dim(players);
do i = 1 to countw(players[player], ',');
activity = input(scan(players[player], i, ','), 8.);
/* Do not output missing positions */
if(NOT missing(activity)) then output;
end;
end;
keep ID player activity;
run;
/* Merge the template with actual values and create a flag when an
an id, player, and activity matches with the template
*/
data want_long;
merge template(in=all)
actual(in=act);
by id player activity;
flag_activity = (all=act);
run;
/* Transpose it back to wide */
proc transpose data=want_long
out=want_wide;
id varname;
by id;
var flag_activity;
run;
按照Stu的示例,DS2
DATA步骤可以使用哈希查找执行其“合并”。哈希查找取决于创建一个将CSV项目列表映射到标志的数据集。
* Create data for hash;
data share_flags(where=(not missing(key)));
length key $7 f1-f4 8;
array k[4] $1 _temporary_;
do f1 = 0 to 1; k[1] = ifc(f1,'1','');
do f2 = 0 to 1; k[2] = ifc(f2,'2','');
do f3 = 0 to 1; k[3] = ifc(f3,'3','');
do f4 = 0 to 1; k[4] = ifc(f4,'4','');
key = catx(',', of k[*]);
output;
end;end;end;end;
run;
proc ds2;
data want2 / overwrite=yes;
declare char(20) id;
vararray char(7) pact[*] activities_p1-activities_p3;
vararray double fp1[*] flag_p1_1-flag_p1_4;
vararray double fp2[*] flag_p2_1-flag_p2_4;
vararray double fp3[*] flag_p3_1-flag_p3_4;
declare char(1) sentinel;
keep id--sentinel;
drop sentinel;
declare char(7) key;
vararray double flags[*] f1-f4;
declare package hash shares([key],[f1-f4],4,'share_flags'); %* load lookup data;
method run();
declare int rc;
set have;
rc = shares.find([activities_p1],[flag_p1:]); %* find() will fill-in the flag variables;
rc = shares.find([activities_p2],[flag_p2:]);
rc = shares.find([activities_p3],[flag_p3:]);
end;
enddata;
run;
quit;
%let syslast = want2;