其他字符串附近的字符串的正则表达式?

问题描述 投票:0回答:2

我想为grep写一个灵活的正则表达式,它将返回彼此相距一定距离的搜索词。

理想的行为就像研究数据库;例如,你可以搜索capitalGDP彼此相差15个字的文章,其中包括字符串capitalGDP可以用五个,六个,七个等等,未指定长度的字母数字字符串分隔的文章。正则表达式语句将包括标点符号(例如,逗号,句号,连字符),还包括重音符号和变音符号。因此,chechèlavi的结果不超过五个字符串。

我想这个陈述将涉及前瞻和像{1,15}这样的短语,或者可能通过另一个grep管道一个grep,但是这会失去GREP_OPTIONS='--color=auto'的好处。构建它真的超出了我的技能。我有一组.txt文档,我想运行搜索,但使正则表达式灵活地改变字符串之间的距离或截断这些术语对于那些在标准中有字段注释或阅读注释之类的东西也很有用格式。

编辑

以下是从圣经中获取的段落样本。

Ye shall buy meat of them for money, that ye may eat; and ye shall also buy water of them for money, that ye may drink. For the Lord thy God hath blessed thee in all the works of thy hand: he knoweth thy walking through this great wilderness: these forty years the Lord thy God hath been with thee; thou hast lacked nothing... Thou shalt sell me meat for money, that I may eat; and give me water for money, that I may drink: only I will pass through on my feet: (as the children of Esau which dwell in Seir, and the Moabites which dwell in Ar, did unto me:) until I shall pass over Jordan into the land which the Lord our God giveth us. But Sihon king of Heshbon would not let us pass by him: for the Lord thy God hardened his spirit, and made his heart obstinate, that he might deliver him into thy hand, as appeareth this day. And the Lord said unto me, Behold, I have begun to give Sihon and his land before thee: begin to possess, that thou mayest inherit his land. Then Sihon came out against us, he and all his people, to fight at Jahaz. And the Lord our God delivered him before us; and we smote him, and his sons, and all his people. And if the way be too long for thee, so that thou art not able to carry it; or if the place be too far from thee, which the Lord thy God shall choose to set his name there, when the Lord thy God hath blessed thee: then shalt thou turn it into money, and bind up the money in thine hand, and shalt go unto the place which the Lord thy God shall choose: and thou shalt bestow that money for whatsoever thy soul lusteth after, for oxen, or for sheep, or for wine, or for strong drink, or for whatsoever thy soul desireth: and thou shalt eat there before the Lord thy God, and thou shalt rejoice, thou, and thine household, and the Levite that is within thy gates; thou shalt not forsake him: for he hath no part nor inheritance with thee... Now it came to pass, that at what time the chest was brought unto the king’s office by the hand of the Levites, and when they saw that there was much money, the king’s scribe and the high priest’s officer came and emptied the chest, and took it, and carried it to his place again. Thus they did day by day, and gathered money in abundance. And when they had finished it, they brought the rest of the money before the king and Jehoiada, whereof were made vessels for the house of the Lord , even vessels to minister, and to offer withal, and spoons, and vessels of gold and silver. And they offered burnt offerings in the house of the Lord continually all the days of Jehoiada. Thou hast bought me no sweet cane with money, neither hast thou filled me with the fat of thy sacrifices; but thou hast made me to serve with thy sins, thou hast wearied me with thine iniquities... Howbeit there were not made for the house of the Lord bowls of silver, snuffers, basins, trumpets, any vessels of gold, or vessels of silver, of the money that was brought into the house of the Lord: but they gave that to the workmen, and repaired therewith the house of the Lord. Moreover they reckoned not with the men, into whose hand they delivered the money to be bestowed on workmen: for they dealt faithfully. The trespass money and sin money was not brought into the house of the Lord: it was the priests’.

如果我想问一下shaltmoney在五个单词(包括标点符号)中共存的情况,我该怎么写这个正则表达式?

我不确定如何给出预期的结果,因为grep --context=1不仅仅包含0-5个字符串之间的字符串,但我想结果会识别:

shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money

但是不会返回shall buy meat of them for money,,因为'money'显示为第六个字符串。

regex shell awk grep pcre
2个回答
1
投票

好吧,它不是grep但是这似乎就是你要求使用GNU awk来实现多字符RS和字边界:

$ cat tst.awk
BEGIN {
    RS="^$"
    split(words,word)
}
{
    gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")
    gsub("\\<"word[1]"\\>","{")
    gsub("\\<"word[2]"\\>","}")
    while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
        tgt =  substr($0,RSTART,RLENGTH)
        gsub(/}/,word[2],tgt)
        gsub(/{/,word[1],tgt)
        gsub(/@C/,"}",tgt); gsub(/@B/,"{",tgt); gsub(/@A/,"@",tgt)
        if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
            print tgt
        }
        $0 = substr($0,RSTART+length(word[1]))
    }
}

$ awk -v words='money shalt' -v range=5 -f tst.awk file
shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money

$ awk -v words='and him' -v range=10 -f tst.awk file
him: for the Lord thy God hardened his spirit, and
and made his heart obstinate, that he might deliver him
him before us; and
and we smote him
him, and

请注意,即使使用像shalt sell me meat for money in thine hand, and shalt这样的输入,其中一个单词(money)在第一次出现另一个单词(shalt)后出现5个单词,并且在第二次出现第一个单词之前出现5个单词(shalt),上述情况也会起作用:

$  echo 'shalt sell me meat for money in thine hand, and shalt' |
    awk -v words='shalt money' -v range=5 -f tst.awk
shalt sell me meat for money
money in thine hand, and shalt

对于颜色,文件名和行号:

这样做是为了查看终端中可用的颜色(每行将以不同的颜色输出):

$ for ((c=0; c<$(tput colors); c++)); do tput setaf "$c"; tput setaf "$c" | cat -v; echo "=$c"; done; tput setaf 0
^[[30m=0
^[[31m=1
^[[32m=2
^[[33m=3
^[[34m=4
^[[35m=5
^[[36m=6
^[[37m=7

既然您可以看到那些转义序列和数字的含义,请将awk脚本更新为(\033 = ^[ = Esc):

$ cat tst.awk
BEGIN {
    RS="^$"
    split(words,word)
    c["black"]  = "\033[30m"
    c["red"]    = "\033[31m"
    c["green"]  = "\033[32m"
    c["yellow"] = "\033[33m"
    c["blue"]   = "\033[34m"
    c["pink"]   = "\033[35m"
    c["teal"]   = "\033[36m"
    c["grey"]   = "\033[37m"
    for (color in c) {
        print c[color] color c["black"]
    }
}
{
    gsub(/@/,"@A"); gsub(/{/,"@B"); gsub(/}/,"@C")
    gsub("\\<"word[1]"\\>","{")
    gsub("\\<"word[2]"\\>","}")
    while ( match($0,/{[^{}]+}|}[^{}]+{/) ) {
        tgt =  substr($0,RSTART,RLENGTH)
        gsub(/}/,word[2],tgt)
        gsub(/{/,word[1],tgt)
        gsub(/@C/,"}",tgt); gsub(/@B/,"{",tgt); gsub(/@A/,"@",tgt)
        if ( gsub(/[[:space:]]+/,"&",tgt) <= range ) {
            print FILENAME, FNR, c["red"] tgt c["black"]
        }
        $0 = substr($0,RSTART+length(word[1]))
    }
}

当你运行它时,你会看到所有可用颜色的转储,并且对于每个目标文本,它将在该文件中以文件名和行号开头,文本将以红色着色:

enter image description here


0
投票

简短的回答:grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money'

也许在两个方向:grep 'shalt\W\+\(\w\+\W\+\)\{0,5\}money\|money\W\+\(\w\+\W\+\)\{0,5\}shalt'

https://www.gnu.org/software/grep/manual/grep.html

\ W'

匹配单词成分,它是'[_ [:alnum:]]'的同义词。

'\ W'

匹配非单词成分,它是'[^ _ [:alnum:]]'的同义词。

动态构造grep的通用答案,在本例中使用shell函数:

find_adjacent() {
    dist="$1"; shift
    grep1="$1"; shift
    grep2="$1"; shift

    between='\W\+\(\w\+\W\+\)\{0,'"$dist"'\}'
    regex="$grep1$between$grep2\|$grep2$between$grep1"

    printf 'Using the regex: %s\n' "$regex" 1>&2
    grep "$regex" "$@"
}

用法示例:

echo 'shalt sell me meat for money
shalt thou turn it into money
money in thine hand, and shalt
shalt bestow that money
capital and GDP' | find_adjacent 3 shalt money -i --color=auto

或者跨行匹配:

find_adjacent 5 shalt money -z file_with_the_bible_passages.txt

Edit

作为pointed out by EdMorton,这只能找到继续匹配的第一部分。它仍然会匹配正确的行,但颜色突出显示会有点。

为了解决这个问题,正则表达式将变得更加复杂,因为它必须匹配任何继续“shalt ... money ... shalt”4种情况:

  • “要......金钱......”
  • “要......钱......要钱......”
  • “钱......要钱......”
  • “钱......要......钱......”

这可以通过用以下方法替换regex=...线来完成:

regex1="$grep1\($between$grep2$between$grep1\)\+"
regex2="$grep1$between$grep2\($between$grep1$between$grep2\)*"
regex3="$grep2\($between$grep1$between$grep2\)\+"
regex4="$grep2$between$grep1\($between$grep2$between$grep1\)*"
regex="$regex1\|$regex2\|$regex3\|$regex4"

另外它可能会像这样混淆: “shalt xxx shalt xxx money xxx money”

距离最多3个字之间,上述正则表达式仍然只能找到: “shalt xxx shalt xxx money”

为了解决这些问题,唯一可行的解​​决方案是,只匹配单词本身并使用前瞻/后视(需要更高级的正则表达式实现,例如GNU grep的-P用于perl正则表达式):

find_adjacent() {
    dist="$1"; shift
    word1="$1"; shift
    word2="$1"; shift

    ahead='\W+(\w+\W+){0,'"$dist"'}'
    behind='(\W+\w+){0,'"$dist"'}\W+'
    regex="$word1(?=$ahead$word2)|(?<=$word2)$behind\K$word1|$word2(?=$ahead$word1)|(?<=$word1)$behind\K$word2"

    printf 'Using the regex: %s\n' "$regex" 1>&2
    grep -P "$regex" "$@"
}

另一个示例用法(搜索不区分大小写,显示文件名和行,突出显示找到的单词,搜索目录中的所有文件):

find_adjacent 15 capital GDP -i -Hn --color=auto -r folder_to_search
© www.soinside.com 2019 - 2024. All rights reserved.