我这里有一个不寻常的。 我们有一个带标题的竖线分隔文件,但在第 9 个字段(接收用户输入)中,我们偶尔可以让用户输入竖线符号。 由于管道符号的数量现在与标题不匹配,因此这会将文件格式完全抛出窗口。 参见下面的示例——它是 EVNT_MSSG 标头下的第 5 个条目:
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here | pipes | not needed | Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
Aparrently,在源头上改变它会产生公司不愿支付的成本,所以我的任务是想出一个解决方案,只去掉第 9 个字段中的管道符号,同时保持所有其他字段完好无损。
我不幸撞到了砖墙。
我正在使用 |作为 awk 中的文件分隔符来拉出第 9 个字段 ie
awk 'BEGIN { FS = "[|]+" } ; { print $9 }'
但是管道将其丢弃,因为他们将第一个不需要的管道视为下一个合法的分隔符。我想我可能不得不从不同的角度来解决这个问题,但还没有找到最模糊的地方。 对此的任何帮助将不胜感激。
此解决方案适用于任何 awk:
awk -v c=9 '
BEGIN {FS=OFS="|"}
NR == 1 {
totCol = NF
print
next
}
diff = NF-totCol {
s = ""
for (i=c; i<=NF; ++i) {
if (i <= c+diff)
s = s $i
$i = $(i+diff)
}
NF = totCol
$c = s
} 1' file
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here pipes not needed Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
我将按照以下方式利用 GNU AWK 完成此任务,让
file.txt
内容成为
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here | pipes | not needed | Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
然后
awk 'BEGIN{FS=OFS=""}NR==1{split($0,arr)}{for(i=1;i<=NF;i+=1){if($i=="|"&&arr[i]!="|"){$i=" "}};print}' file.txt
给出输出
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here pipes not needed Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
解释:我通知 GNU
AWK
字段分隔符和输出字段分隔符都应该是空字符串,这导致每个字段都是一个字符宽。处理第一行(NR==1
)时,我使用split
函数用列填充数组arr
,然后对于每一行我遍历所有字段,如果我找到包含|
的字段和中的相应字段arr
不成立 |
我将该字段更改为空格字符。处理字段后 I print
line.
(在 GNU Awk 5.0.1 中测试)
也许这种方法适合?
awk 'BEGIN{FS="|"} # set field separator to pipe
NF == 15 {print} # if number of fields is correct, print the line
NF > 15 { # if NF is greater than expected i.e. extra pipes in $9
for (i=1;i<=8; i++) {printf "%s|", $i} # print the first 8 fields
for (j=9; j<=(NF-6); j++) {printf "%s", $j} # print the next N fields without a pipe delimiter
for (k=(NF-5); k<=NF; k++) {printf "|%s", $k} # print the last 6 fields
print "" # print a newline
}' file
应用于您的示例数据:
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here pipes not needed Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
假设您有固定数量的字段,这是一种可能的方法:
perl -pe 's/^([^|]*\|){8}\K.*?(?=(\|[^|]*){6}$)/$&=~s,\|,\\|,gr/e' ip.txt
这将在有问题的领域用
|
替换\|
。
^([^|]*\|){8}\K
将消耗前 8 个字段,\K
防止它成为匹配部分的一部分。
.*?
延迟匹配零个或多个字符。
(?=(\|[^|]*){6}$)
正面前瞻以确保最后 6 个字段未被修改。
e
标志允许在替换部分使用 Perl 代码。在这种情况下,$&
中匹配的部分被修改并根据需要返回。
echo '
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here | pipes | not needed | Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098'
|
mawk NF=NF FS=' [|] ' OFS=' '
IDS|STG |STT|WRKLST |AR|CD |DT |INDX|EVNT_MSSG |EVNT_SRC|EVNT_TM |TYP|DATE |USR_ID|IDS_APP
1 |ENRICH|Inc|complete|14|BM404|202302|15 |This is some text |Operator|10:33:13|0 |20230220|admin |3177098
2 |ENRICH|Inc|complete|15|BM501|202302|16 |This is some more |Operator|10:33:13|0 |20230220|admin |3177098
3 |ENRICH|Inc|complete|16|BM502|202302|17 |This bit is all good |Operator|10:33:13|0 |20230220|admin |3177098
4 |ENRICH|Inc|complete|17|BM551|202302|18 |Yet more text |Operator|10:33:13|0 |20230220|admin |3177098
5 |ENRICH|Inc|complete|18|EM002|202302|19 |problem here pipes not needed Call |Operator|10:33:14|0 |20230220|admin |3177098
6 |ENRICH|Inc|complete|19|BM451|202302|20 |This is also fine |Operator|10:33:14|0 |20230220|admin |3177098
只要做出关于
的假设
- 所有格式正确的字段都左对齐,
那么这甚至适用于可变宽度的字段/列。