如何做一个重音不敏感的grep？

Question

有没有办法使用grep进行重音不敏感搜索，最好保留--color选项？我的意思是grep --secret-accent-insensitive-option aei会匹配àei但也会äēì和可能的æi。

我知道我可以使用iconv -t ASCII//TRANSLIT从文本中删除重音，但我不知道如何使用它来匹配，因为文本被转换（它适用于grep -c或-l）

Answer 1

您正在寻找一大堆POSIX正则表达式equivalence classes：

14.3.6.2等价类运算符（[= … =]）

Regex recognizes equivalence class expressions inside lists. A equivalence class expression is a set of collating elements which all belong to the same equivalence class. You form an equivalence class expression by putting a collating element between an open-equivalence-class operator and a close-equivalence-class operator. [= represents the open-equivalence-class operator and =] represents the close-equivalence-class operator. For example, if a and A were an equivalence class, then both [[=a=]] and [[=A=]] would match both a and A. If the collating element in an equivalence class expression isn’t part of an equivalence class, then the matcher considers the equivalence class expression to be a collating symbol.

我在下一行使用插入符号来指示实际着色的内容。我还调整了测试字符串来说明关于案例的观点。

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=]][[=e=]][[=i=]]'
I match àei but also äēì and possibly æi
        ^^^          ^^^

这匹配所有单词，如aei。它与æi不匹配的事实应该提醒你，你要对你正在使用的正则表达式库中存在的任何映射感兴趣（可能是gnulib，这是我链接和引用的），尽管我认为它很可能是digraphs甚至超出了最佳等价类映射的范围。

你不应该期望等价类是可移植的，因为它们太神秘了。

更进一步，如果你只需要重音字符，事情会变得复杂得多。在这里，我已将您对aei的请求更改为[aei]。

$ echo "I match àei but also äēì and possibly æi" | grep '[[=a=][=e=][=i=]]'
I match àei but also äēì and possibly æi
^  ^    ^^^     ^    ^^^ ^       ^     ^

清除它以避免非重音匹配需要等价类和前瞻/后视，而BRE（基本POSIX正则表达式）和ERE（扩展POSIX正则表达式）支持前者，它们都缺少后者。 Libpcre（grep -P和其他大多数人使用的perl兼容正则表达式的C库）和perl支持后者但缺少前者：

尝试＃1：grep与libpcre：失败

$ echo "I match àei but also äēì and possibly æi" \
    | grep -P '[[=a=][=e=][=i=]](?<![aei])'
grep: POSIX collating elements are not supported

尝试＃2：perl本身：失败

$ echo "I match àei but also äēì and possibly æi" \
    | perl -ne 'print if /[[=a=][=e=][=i=]](?<![aei])/'
POSIX syntax [= =] is reserved for future extensions in regex; marked by <-- HERE in m/[[=a=][=e= <-- HERE ][=i=]](?<![aei])/ at -e line 1.

尝试＃3：python（它有自己的PCRE实现）:(沉默）失败

$ echo "I match àei but also äēì and possibly æi" \
    | python -c 'import re, sys;
                 print re.findall(r"[[=a=][=e=][=i=]]", sys.stdin.read())'
[]

哇，PCRE，python甚至perl不支持的正则表达式功能！这些并不多。（不要在意第二个等价类的抱怨，它仍然只是给/[[=a=]]/抱怨。）这是等价类是神秘的进一步证据。

实际上，似乎没有任何PCRE库能够进行等价类;关于equivalence classes at regular-expressions.info的部分仅声称实现POSIX标准的正则表达式库实际上有这种支持。 GNU grep最接近，因为它可以执行BRE，ERE和PCRE，但它无法将它们组合在一起。

所以我们将分两部分来完成。

尝试＃4：恶心的诡计：成功

$ echo "I match àei but also äēì and possibly æi" \
    | grep --color=always '[[=a=][=e=][=i=]]' \
    | perl -pne "s/\e\[[0-9;]*m\e\[K(?i)([aei])/\$1/g"
I match àei but also äēì and possibly æi
        ^            ^^^

代码行走：

grep强制颜色，以便perl可以键入颜色代码以记录匹配
${GREP_COLOR:-01;31}注意到grep的颜色（默认为相同的亮红色）
perl的s///命令匹配完整的颜色代码，然后匹配我们想要从最终结果中删除的非重音字母。它用（未着色的）字母替换所有这些
qazxsw poi在qazxsw poi正则表达式之后的任何事情是不区分大小写的，因为(?i)匹配perl
[[=i=]]在完成I执行后打印其输入的每一行

有关BRE vs ERE与PCRE和其他人的更多信息，请参阅perl -p或-e。有关每种语言差异的更多信息（包括libpcre与python PCRE vs perl），请查看this StackExchange regex post。

2019年更新：GNU Grep现在使用POSIX regexps at regular-expressions.info，它看起来像tools at regular-expressions.info，优先于像$GREP_COLORS这样的老ms=1;41。这很难提取（并且很难在两者之间徘徊），所以我修改了try＃4中的perl代码来寻找任何$GREP_COLOR，而不是只关注grep会添加的颜色。有关上一个代码，请参阅1;41。

我目前无法验证Apple Mac OS X使用的SGR color code是否支持POSIX正则表达式等价类。

Answer 2

我不认为这可以在grep中完成，除非你愿意编写一个使用revision 2 of this answer和BSD grep的shell脚本，这与你请求的内容有点不同。

以下是通过快速perl脚本非常接近您的请求：

iconv

Markdown不允许我制作红色文本，所以这里的输出是用引号命中的：

diff

这将突出显示匹配的单词而不是实际的匹配，如果不制作大量的字符类和/或组成零碎的正则表达式解析器，这将很难做到。因此，搜索模式“ae”而不是“aei”将产生相同的结果（在这种情况下）。

在这个玩具示例中没有复制grep的标志。我想保持简单。

Answer 3

对于我来说，使用来自php的grep（可以改编）比perl解决方案真的更快。

Strtolower你的查询字符串没有重音，然后用他们的重音形式替换一些字母，grep -i用于不区分情况的研究（注意$ q中的引号）：

#!/usr/bin/perl
# tgrep 0.1 Copyright 2014 by Adam Katz, GPL version 2 or later

use strict;
use warnings;
use open qw(:std :utf8);
use Text::Unidecode;

my $regex = shift or die "Missing pattern.\nUsage: tgrep PATTERN [FILE...]";

my $retval = 1;  # default to false (no hits)

while(<>) {
  my $line = "", my $hit = 0;
  while(/\G(\S*(?:\s+|$))/g){             # for each word (w/ trailing spaces)
    my $word = $1;
    if(unidecode($word) =~ qr/$regex/) {  # if there was a match
      $hit++;                             # note that fact
      $retval = 0;                        # final exit code will be 0 (true)
      $line .= "\e[1;31m$word\e[0;0m";    # display word in RED
    } else {
      $line .= $word;                     # display non-matching word normally
    }
  }
  print $line if $hit;                    # only display lines with matches
}

exit $retval;

如何做一个重音不敏感的grep？

问题描述投票：9回答：3

3个回答

最新问题

如何做一个重音不敏感的grep？

问题描述 投票：9回答：3

3个回答

最新问题

问题描述投票：9回答：3