我有一个包含长自由流动文本值的文本字段的数据集,我需要从该文本字段中识别并提取所有 16 位帐号,并从这些提取的帐号中创建一个列
我拥有的数据
input acct_num txt_field ;
DATALINES;
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024
7789976 Data declined and assigned to 7890512323454545
我需要的数据
acct_num txt_field acct1 acct2
3435436 Payment issue reported 3456123789065322 to 0909876789432123 dated 9 mar 2024 3456123789065322 0909876789432123
7789976 Data declined and assigned to 7890512323454545 7890512323454545
到目前为止,我已经使用了 Prxparse 和 prxmatch 函数,但是当您知道在文本字段中到底要查找什么内容时,这些函数才有效,这里我只是查找任何 16 位数字值
使用正则表达式,您走在正确的轨道上。使用
call prxnext()
迭代 16 位帐号的所有实例。正则表达式 \b\d{16}\b
会找到这些。
data want;
set have;
retain exprid;
/* Generate an expression ID */
if(_N_ = 1) then exprid = prxparse('/\b\d{16}\b/');
/* Scan all of the text */
stop = length(txt_field);
/* Find the first value */
call prxnext(exprid, 1, stop, txt_field, pos, len);
/* Keep scanning until there are no more account numbers found */
do while (pos > 0);
acct_num_16 = substr(txt_field, pos, len);
output;
call prxnext(exprid, 1, stop, txt_field, pos, len);
end;
keep acct_num txt_field acct_num_16;
run;
acct_num txt_field acct_num_16
3435436 Payment issue reported ... 3456123789065322
3435436 Payment issue reported ... 0909876789432123
7789976 Data declined and assigned ... 7890512323454545