正则表达式在缩写后添加逗号

问题描述 投票:0回答:1

我想在定义为单个或多个字母后跟一个点后跟一个或多个字母重复 2 次或更多次的缩写后添加一个逗号和一个空格

, 
。例如,这些被视为缩写
A.b.C.
a.b.
ab.cd.
ab.cde.
ab.cd.ef.gh.
而这些不是缩写
a.b
A. B
我不想加逗号:

  • 如果缩写的最后一个点是给定文本的结尾,
  • 如果缩写后有可选空格和大写字母,或者
  • 如果缩写后有可选空格和另一个标点符号。

给定以下测试句子:

test_str = """This is an example e.g. sentence and this is with i.e. text and two abbreviations S.T.R. and K.LM.NO.P. as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g. book 1 or i.e. book2.
            A.B.C. is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d. is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

我希望输出如下:

output_text = """This is an example e.g., sentence and this is with i.e., text and two abbreviations S.T.R., and K.LM.NO.P., as example with acronym.
            but in here it shouldn't catch it because after that there is space and dot g.k. . Also here it shouldn't detect because the next sentence starts with capital A.BC.D.
            And this is a normal sentence. Followed by another normal sentence. This contains only one letter A. and is not abbreviation.
            This shouldn't match i.e., since it contains already a comma. I like to read books such as e.g., book 1 or i.e., book2.
            A.B.C., is an abbreviation that should match. A.B.! is an abbreviation that shouldn't match because it has ! after the abbreviation. 
            A.B.? is an abbreviation that shouldn't match because it has ? after the abbreviation. 
            A.B. ; is an abbreviation that shouldn't match because it has a space and ; after the abbreviation.
            a.b.c.d., is an abbreviation that should match.
            a.b.c., is an abbreviation that shouldn't match because it already has a comma. A.B is not an abbreviation because it contains only one dot.
            Another abbreviation that should not match j.j.L.o.U.h."""

我现在用的是:

regex = r"(\b(?:[A-Za-z]\.){2,}(?!\s*[,.;?!-]))"

但它产生以下输出:

This is an example e.g., sentence and this is with i.e., text 和 两个缩写 S.T.R., 和 K.LM.NO.P. 以首字母缩略词为例。 但在这里它不应该捕捉到它,因为在那之后有空间和点 g.k. .同样在这里它不应该检测到因为 下一句以大写字母 A.BC.D 开头。 这是一个正常的句子。紧接着又是一句普通话。这只包含一个字母 A. 并且不是 缩写。 这不应该匹配,即,因为它已经包含一个逗号。我喜欢阅读书籍,例如第 1 本书或第 2 本书。 A.B.C., 是一个应该匹配的缩写。 A.B.!是一个不应该匹配的缩写,因为它有 !之后 缩写。 AB?是一个不应该匹配的缩写,因为它有 ?缩写之后。 A.B. ;是一个不应该匹配的缩写,因为它有一个空格和 ;缩写之后。 a.b.c.d., 是一个应该匹配的缩写。 a.b., c., 是一个不应匹配的缩写,因为它已经有一个逗号。 A.B 不是缩写,因为它 只包含一个点。
另一个不应该匹配的缩写j.j.L.o.U.h.,

我的正则表达式失败的情况以粗体显示。它们应该是

K.LM.NO.P.,
a.b.c.,
j.j.L.o.U.h.
,因为第一个应该被检测为缩写,第二个在最后一个点之后已经包含一个标点符号,最后一个是给定文本的结尾。

有办法实现吗?非常感谢任何帮助!

python python-3.x regex regex-group
1个回答
1
投票

您可以使用此正则表达式进行匹配:

(?<=\.[a-zA-Z])\.(?=\s[a-z])

并替换为字符串

.,
.

正则表达式演示

正则表达式详细信息:

  • (?<=\.[a-zA-Z])
    :在匹配点之前断言我们有一个点和一个字母
  • \.
    :匹配一个点
  • (?=\s[a-z])
    :断言在匹配一个点后我们有一个空格和一个小写字母
© www.soinside.com 2019 - 2024. All rights reserved.