awk - 对匹配发生进行编号无法正常工作

问题描述 投票:0回答:3

我正在尝试使用下面的 awk 命令突出显示匹配单词的每次出现,

输入:

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

命令:

python -c 'import this' |  awk -v t="better" ' { gsub(t,"[" "&-" ++a["&"] "]" ,$0); print } ' 

但看起来 gsub() 无法正常工作。输入总共有 8 个“better”匹配,但对于最后一个,上述命令打印“better-18”。如何解决这个问题。

Although never is often [better-18] than *right* now.. # wrong should be 8

预期输出:

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

我希望有一个可扩展的解决方案,可以容纳更多单词,即从另一个文本文件输入,只需对输入文件进行一次扫描。 “单词”应该是完全匹配而不是部分/子字符串。

匹配.txt

better
idea
unix awk
3个回答
1
投票

你不能这样做:

gsub(t,"[" "&-" ++a["&"] "]" ,$0)

因为它会执行部分正则表达式匹配而不是完整字符串匹配,更重要的是,在调用

gsub()
之前评估
gsub()
的参数,因此
++a["&"]
会递增索引为 的
a[]
的值调用
"&"
之前的单字符串
gsub()
。这和你写的一模一样:

foo=(++a["&"]); gsub(t,"[" "&-" foo "]" ,$0)

这可能就是您想要做的,使用 GNU awk 来实现

patsplit()
:

$ cat tst.awk
NR==FNR {
    words[$1]
    next
}
{
    n = patsplit($0,flds,/[[:alnum:]_]+/,seps)
    out = seps[0]
    for (i=1; i<=n; i++) {
        word = flds[i]
        if ( word in words ) {
             word = "[" word "-" (++cnt[word]) "]"
        }
        out = out word seps[i]
    }
    print out
}

$ awk -f tst.awk match.txt file
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad [idea-1].
If the implementation is easy to explain, it may be a good [idea-2].
Namespaces are one honking great [idea-3] -- let's do more of those!

上面假设您对“单词”的定义是任何字母数字或下划线(即“单词组成”)字符序列 - 如果不是,则只需将

[[:alnum:]_]+
更改为与您对“单词”的定义相匹配的任何序列。

您可以在任何带有

while (match(..,/[[:alnum:]_]+/)) substr(...
循环的 POSIX awk 中执行相同的操作 - 如果您愿意,可以将其作为练习。

如果您想处理作为变量赋值传递的 1 个单词,则:

$ cat tst.awk
{
    n = patsplit($0,flds,/[[:alnum:]_]+/,seps)
    out = seps[0]
    for (i=1; i<=n; i++) {
        word = flds[i]
        if ( word == t ) {
             word = "[" word "-" (++cnt) "]"
        }
        out = out word seps[i]
    }
    print out
}

$ awk -v t='better' -f tst.awk file
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

0
投票

对于多个匹配单词,也可以使用Perl

$ perl -pe ' BEGIN { @m= map {chomp;$_} qx(cat match.txt); $t=join("\|",@m) }; \
s/$t/$kv{$&}++;"[$&-$kv{$&}]"/ge ' input.txt
The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad [idea-1].
If the implementation is easy to explain, it may be a good [idea-2].
Namespaces are one honking great [idea-3] -- let's do more of those!

0
投票

如果您不介意没有任何循环、函数调用或

gawk
块的
END { }
特定解决方案:


echo "${input}" |

gawk -e 'BEGIN { _ ^= __ = "[" (RS = "better") "-"
                     ___ = "]" } ! (ORS = RT) || ORS = (__ _++)___'

The Zen of Python, by Tim Peters

Beautiful is [better-1] than ugly.
Explicit is [better-2] than implicit.
Simple is [better-3] than complex.
Complex is [better-4] than complicated.
Flat is [better-5] than nested.
Sparse is [better-6] than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is [better-7] than never.
Although never is often [better-8] than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!    
© www.soinside.com 2019 - 2024. All rights reserved.