使用Linux脚本在文件中插入分隔符[关闭]

问题描述 投票:0回答:5

我有一个非分隔的文本文件,包含大约100万行。

样本行

1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659      [email protected]                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

在以数字“2”,“1”,“3”(rowtype)开头的每一行上,我必须根据字符数插入分隔符,即在结尾0-1,1-20,21-25 ......所以上

如何使用Linux脚本执行此操作?

期望的输出

1|YBL LOYALTY EXT |10001|01172019|001
2|00010010100001151|2753|184907301010614199100919699034659      |[email protected]                                     |VIDYA SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3|000000027

我试过这个命令

perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_"}   if(/^1/) { @x=(1,16,5,8); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_" }  if(/^3/) { @x=(1); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ } 
       print "$_" }'  filename`

输入行

1YBL LOYALTY EXT 1000112102018001
2000100101000002631653184911501010111199100919323739251      [email protected]                                   VIJAY PANDEY                            PART OF GROUND FLOOR & BASEMENT         SHOPPER STOP SV ROAD ANDHERI WEST       LANDMARK-ERSTWHILE CRASSWORD BOOK STORE MUMBAI                        400058
2000100101000019920453184964321010513199000919878857482      [email protected]                                  MOHAMAD MAQSHUD MASTER                  H COLLECTION NEW SHIVPURI               GALI NO 1                               NEAR MAKHAN SINGH CHOWK                 LUDHIANA                      141008
2000100101000023500853184923441010913197300919375580888      [email protected]                                        JAYANTIBHAI TADA                        44 KHODIYAR NAGAR B S ABHISHEK          SUDAMA CHOWK                            KHODIYARNAGAR MOTA VARACHHA             SURAT                         395006
3000000066

预期输出

1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |[email protected]                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |[email protected]                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |[email protected]                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|000000066

得到这个但是

1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |[email protected]                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |[email protected]                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
1|41008|
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |[email protected]                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|95006
3|000000066
linux bash awk sed
5个回答
1
投票

您也可以尝试Perl

perl -lpe ' if(/^2/) { @x=(1,17,4); 
           for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' input_file

给定的输入

$ cat rahman.txt
1YBL LOYALTY EXT 1000101172019001
2000100101000011512753184907301010614199100919699034659      [email protected]                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

$ perl -lpe ' if(/^2/) { @x=(1,17,4); 
             for $i (@x) { s/(.{$i})//; printf("%s|",$1) } }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      [email protected]                                     VIDYA SAGAR                             CROSS                                   BANDRA                                  WM                                      DELHI                         456471
3000000027

$

只需将条目添加到@ x =(1,17,4).. @ x =(1,17,4,10,20)

EDIT1

要为可以按空格分割的字段添加分隔符,请使用以下内容

$ perl -lpe ' if(/^2/) { @x=(1,17,4); 
             for $i (@x) { s/(.{$i})//; printf("%s|",$1) } s/\S+\s+\K/|/g }' rahman.txt
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      |[email protected]                                     |VIDYA |SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3000000027

$

代码说明

Explanation
perl -lpe   # use -p for printing by default at the end of perl one-liner
        # this makes sure when you dont have a line starting with 2 the line is printed after the if statement.

' if(/^2/)  # if - select line that starts with 2. $_ will have the current line
{ 
@x=(1,17,4); # x is an array to hold the widths of fields. - 1, 17, 4 
for $i (@x)  # open for loop to loop through the array x
{ 
s/(.{$i})//;  # no variable is specified, so the substitution acts on the $_ i.e current line
          # first instance is s/(.{1})// => match one character and store it in $1 capturing variable
          # replace the captured part with nothing and update $_
          # e.g if the line is "200010010100001151" .. loop one will capture "2" and $_ becomes "00010010100001151"
          # loop 2 => s/(.{17})// matches 17 character and $1 stores "00010010100001151"
printf("%s|",$1)  # print $1 along with delimiter pipe 
}  # end of for loop
}  # end of if
# here is default print statement in perl that will print the $_ after all modification
' input_file

Aaditi

根据您的输入我得到以下结果。它工作正常..你看到了什么问题?

$ perl -ne ' if(/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_"}   if(/^1/) { @x=(1,16,5,8); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_" }  if(/^3/) { @x=(1); $i=0;
>        while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
>        print "$_" }'  rahman.txt
1|YBL LOYALTY EXT |10001|01172019|001
2|0001001010000115127|531849|0730|101|06141991|00919699034659      |[email protected]                                     VID|YA SAGAR                             CRO|SS                                   BAN|DRA                                  WM |                                     DEL|HI                         456|471
3|000000027

$

Aadita:

得到了问题... $ _被修改,所以在/ ^ 2 / if循环结束时,$ _保持“141008”的值,然后满足下一个if(/ ^ 1 /)条件并且如果还要执行..为了避免它,只需将$ _复制到开头的$ line变量,然后在单独的if循环中检查$ line / / ^ 2 /,/ ^ 3 /,/ ^ 1 /。

$ perl -lne '$line=$_; if($line=~/^2/) { @x=(1,19,6,4,3,8,20,60,40,40,40,40,30); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
        print "$_" }
       if($line=~/^1/) { @x=(1,16,5,8); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
        print "$_" }
       if($line=~/^3/) { @x=(1); $i=0;
       while($i<=$#x) { $s=$x[$i]; $_=~s/(.{$s})/printf("%s|",$1);""/e;$i++ }
       print "$_" }'  rahman2.txt
1|YBL LOYALTY EXT |10001|12102018|001
2|0001001010000026316|531849|1150|101|01111991|00919323739251      |[email protected]                                   |VIJAY PANDEY                            |PART OF GROUND FLOOR & BASEMENT         |SHOPPER STOP SV ROAD ANDHERI WEST       |LANDMARK-ERSTWHILE CRASSWORD BOOK STORE |MUMBAI                        |400058
2|0001001010000199204|531849|6432|101|05131990|00919878857482      |[email protected]                                  |MOHAMAD MAQSHUD MASTER                  |H COLLECTION NEW SHIVPURI               |GALI NO 1                               |NEAR MAKHAN SINGH CHOWK                 |LUDHIANA                      |141008
2|0001001010000235008|531849|2344|101|09131973|00919375580888      |[email protected]                                        |JAYANTIBHAI TADA                        |44 KHODIYAR NAGAR B S ABHISHEK          |SUDAMA CHOWK                            |KHODIYARNAGAR MOTA VARACHHA             |SURAT                         |395006
3|000000066

$

4
投票

使用FIELDWIDTHS的GNU awk:

$ awk -v FIELDWIDTHS='1 17 4 *' -v OFS='|' '/^2/{$1=$1; gsub(/\s+/,"&"OFS)} 1' file
1YBL LOYALTY EXT 1000101172019001
2|00010010100001151|2753|184907301010614199100919699034659      |[email protected]                                     |VIDYA |SAGAR                             |CROSS                                   |BANDRA                                  |WM                                      |DELHI                         |456471
3000000027

FIELDWIDTHS的上述用法表示输入应分为4个字段,宽度为1个字符,17个字符,4个字符,然后是其余字段。

为字段赋值时,awk重新编译记录,用值OFS替换输入字段分隔符,因此$ 1 = $ 1导致在FIELDWIDTHS描述的每个字段之间插入|s。

一旦完成,仍然存在所有剩余的空格分隔文本以添加字段分隔符,以便gsub()在每个空间系列之后附加OFS。

较旧版本的gawk不支持*意思是the rest of the line - 如果你有这种情况,那么只需用像*这样的大值替换99999


0
投票

你的文件中有分隔符,你只是看不到它们:它是空格/制表符。所以你只需要替换那些,使用sed/xxx/|/g命令(由xxx我的意思是空格或TAB字符)。如果您怀疑字符是空格还是制表符,可以在十六进制编辑器中打开文件(空格为ASCII码32(十六进制:20),TAB为9(十六进制:09))。


0
投票

您可以尝试使用gnu sed:

sed -E '/^2/{s//&|/;s/(.{19})(....)(\S+\s+)/\1|\2|\3|/}' infile

0
投票

如果您没有FIELDSWIDTHS,请尝试以下操作。

awk -v var="1,18,4" -v OFS="|" '
BEGIN{
  num=split(var,array,",")
}
{
  for(i=1;i<=num;i++){
     val=val?(i==num?val substr($0,array[i-1]+1,array[i]):val substr($0,array[i-1]+1,array[i]) OFS):substr($0,1,array[i]) OFS
     sum+=array[i]
  }
  if(sum==length($0)){
    print val
  }
  else{
    rest=substr($0,sum)
    gsub(/[[:space:]]+/,"&"OFS,rest)
    print val,rest
  }
    sum=rest=val=""
}
'   Input_file
© www.soinside.com 2019 - 2024. All rights reserved.