我有一个如下所示的数据集:
Account Number 6m 7m 8m 9m 10m 11m
1 Better X < 10 X < 10 Better X < 30 X < 30
2 X < 10 X < 20 X < 30 X < 20 X < 20 X < 20
3 Better Better Better Better X < 10 X < 20
4 X < 10 Better Same Same Same Same
5 Same Better Same Same Same Same
6 Same Same Same Better Better Better
7 Same X < 10 X < 10 X < 10 X < 10 Better
8 Better Better Better Better Better Better
9 X < 10 X < 10 X < 10 X < 20 X < 30 Better
10 X < 20 X < 30 X < 30 X < 30 X < 30 X < 30
每个单元格告诉我每个帐号 6-11 个月后发生了什么。我想将其转换为一个数据集,我可以从中创建图形等,因此想将其转置为如下所示:
Result 6m 7m 8m 9m 10m 11m
X < 10 3 3 3 1 2 0
X < 20 1 1 0 2 1 2
X < 30 0 1 1 1 2 1
Same 3 1 3 2 2 2
Better 1 2 1 2 2 4
如果有一种方法可以将每列的计数转换为百分比,那就更好了。
data have;
infile datalines dlm='|';
input "Account Number"n "6m"n$ "7m"n$ "8m"n$ "9m"n$ "10m"n$ "11m"n$;
datalines;
1|Better|X < 10|X < 10|Better|X < 30|X < 30
2|X < 10|X < 20|X < 30|X < 20|X < 20|X < 20
3|Better|Better|Better|Better|X < 10|X < 20
4|X < 10|Better|Same|Same|Same|Same
5|Same|Better|Same|Same|Same|Same
6|Same|Same|Same|Better|Better|Better
7|Same|X < 10|X < 10|X < 10|X < 10|Better
8|Better|Better|Better|Better|Better|Better
9| X < 10|X < 10|X < 10|X < 20|X < 30|Better
10| X < 20|X < 30|X < 30|X < 30|X < 30|X < 30
;
run;
首先,堆叠数据,以便我们进行一些计数:
data stack;
set have;
array charvars[*] _CHARACTER_;
do i = 1 to dim(charvars);
result = charvars[i];
var = vname(charvars[i]);
output;
end;
keep result var;
run;
这让你:
result var
Better 6m
X < 10 7m
X < 10 8m
Better 9m
X < 30 10m
X < 30 11m
... ...
我确信有了这些数据,你可以用
proc report
做一些非常酷的事情,但这不是我特别了解的领域。相反,我们将通过其他几个步骤创建数据集。
我们可以折叠它并计算每个
result, var
组合中的值数量,然后计算其中每个 var
的百分比:
proc sql;
create table count as
select result, var, total, total / sum(total) as pct format=percent8.1
from (select result, var, count(*) as total
from stack
group by result, var
)
group by var
order by result, var
;
quit;
这给我们带来了这个:
result var total pct
Better 10m 2 20.0%
Better 11m 4 40.0%
Better 6m 3 30.0%
Better 7m 4 40.0%
Better 8m 2 20.0%
Better 9m 4 40.0%
... ... ... ...
现在我们已经拥有将其转换为我们想要的格式所需的一切。
id
中的proc transpose
语句将允许我们使用var
作为每个转置列的名称。我们将在 result
之前完成此任务。
proc transpose data=count out=count_tpose(drop=_NAME_);
by result;
id var;
var pct;
run;
这几乎让我们得到了我们想要的:
result 10m 11m 6m 7m 8m 9m
Better 20.0% 40.0% 30.0% 40.0% 20.0% 40.0%
Same 20.0% 20.0% 30.0% 10.0% 30.0% 20.0%
X < 10 20.0% . 30.0% 30.0% 30.0% 10.0%
X < 20 10.0% 20.0% 10.0% 10.0% . 20.0%
X < 30 30.0% 20.0% . 10.0% 20.0% 10.0%
现在我们只需要通过以下方式清理它:
result
至所需顺序/* Replace missing with 0 */
proc stdize data=count_tpose
out=want
missing=0
reponly;
run;
/* Fix sort order */
data want_sorted;
/* Set variable order */
length Result $10.
"6m"n "7m"n "8m"n "9m"n "10m"n "11m"n 8.
;
set want;
select(result);
when('X < 10') order = 1;
when('X < 20') order = 2;
when('X < 30') order = 3;
when('Same') order = 4;
otherwise order = 5;
end;
run;
proc sort data=want_sorted out=want_sorted_final(drop=order);
by order;
run;
这让我们得到了我们想要的最终结果:
Result 6m 7m 8m 9m 10m 11m
X < 10 30.0% 30.0% 30.0% 10.0% 20.0% 0.0%
X < 20 10.0% 10.0% 0.0% 20.0% 10.0% 20.0%
X < 30 0.0% 10.0% 20.0% 10.0% 30.0% 20.0%
Same 30.0% 10.0% 30.0% 20.0% 20.0% 20.0%
Better 30.0% 40.0% 20.0% 40.0% 20.0% 40.0%