我有一个大型数据集,其中包含在多个时间点收集的数百个变量。变量已按时间点定义,但每个观察都是不同的时间点。就好像数据集是以宽格式规划的,但以长格式收集的,如下所示:
data have;
input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 time_1 1 . . note1 . .
1 time_2 . 2 . . note2 .
1 time_3 . . 3 . . note3
2 time_1 1 . . note1 . .
2 time_2 . 2 . . note2 .
2 time_3 . . 3 . . note3
;
run;
变量
timepoint
是多余的;变量已经描述了时间点。我需要将数据集折叠成每个 id 的单个观察(仅折叠成标准宽格式),如下所示:
data want;
input id var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 1 2 3 note1 note2 note3
2 1 2 3 note1 note2 note3
;
run;
注意:不幸的是,变量名称并不总是以 _t1、_t2 等结尾,如我的示例(例如 note_t1_2),因此我无法轻松引用后缀。
我的第一个想法是将数据集分解为单独的访问数据集(
data timepoint_1_have; set have; if timepoint = "time_1";
等),然后按id合并它们。直接合并会导致数据丢失(我确信这是显而易见的,但我认为丢失的值可能会被覆盖)。所以我想,在按 id 合并之前,我将删除所有仅缺少值的变量。事实证明,这非常困难,我找不到一种方法可以在没有宏页面的情况下对字符和数字变量执行此操作...
所以我使用
retain
尝试了不同的策略。由于时间点>1处的变量的first.id有缺失值,我认为在时间点上保留非缺失值的实例然后保留last.id可能会起作用:
data timepoint_1_want;
set timepoint_1_have;
array Nums[*] _numeric_;
array Chars[*] _character_;
by id;
do i = 1 to dim(Nums);
if not missing(Nums[i]) then do;
retain Nums[i];
end;
do i = 1 to dim(Chars);
if not missing(Chars[i]) then do;
retain Chars[i];
end;
drop i;
if last.id then output;
run;
但是
retain
不能在 do 循环中使用,所以这也不起作用:
12155 data timepoint_1_want;
12156 set timepoint_1_have;
12157 array Nums[*] _numeric_;
12158 array Chars[*] _character_;
12159 by id;
12160 do i = 1 to dim(Nums);
12161 IF not missing(Nums[i]) THEN do;
12162 retain Nums[i];
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
_CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 76-322: Syntax error, statement will be ignored.
12163 end;
12164 do i = 1 to dim(Chars);
12165 IF not missing(Chars[i]) THEN do;
12166 retain Chars[i];
-
22
76
ERROR 22-322: Syntax error, expecting one of the following: a name, a quoted string,
a numeric constant, a datetime constant, a missing value, (, -, :, ;, _ALL_,
_CHARACTER_, _CHAR_, _NUMERIC_.
ERROR 76-322: Syntax error, statement will be ignored.
12167 end;
12168 drop i;
12169 IF last.id THEN output;
12170 run;
您可以使用更新技巧。
data have;
input id timepoint $ var_t1 var_t2 var_t3 note_t1 $ note_t2 $ note_t3 $;
datalines;
1 time_1 1 . . note1 . .
1 time_2 . 2 . . note2 .
1 time_3 . . 3 . . note3
2 time_1 1 . . note1 . .
2 time_2 . 2 . . note2 .
2 time_3 . . 3 . . note3
;
run;
data want;
update have(keep=id obs=0) have(drop=timepoint);
by id;
run;