Python 正则表达式前瞻无法正确拆分

问题描述 投票:0回答:1

我的文本由多个部分组成。在每个部分:

  • 标题大写,可能跨越多行
  • 正文可能有首字母缩略词,所以我们不能假设大写单词标志着每个部分的开始

各部分之间可能有零个或多个换行符。

例子

import re

text = """
Lorem ipsum

THIS SECTION IS A SHORT STORY
1 Hello world
2 Bye bye
Side comment


NEXT SECTION SPANS 200
YEARS AND MANY COUNTRIES!

3 Joe Bloggs attended a NATO summit
4 John Doe heard...
THIS SECTION HAS NO
LINE BREAK / SPACE FROM
THE PREVIOUS ONE

5 Alice thought...
6 Bob visited...
""".strip()

re.split("\n(?=[^a-z]+\n+[a-z\d])", text)

我希望它按这样的部分拆分文本:

["Lorem ipsum\n",
 "THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment\n\n",
 "NEXT SECTION SPANS 200\nYEARS AND MANY COUNTRIES!\n\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...",
 "THIS SECTION HAS NO\nLINE BREAK / SPACE FROM\nTHE PREVIOUS ONE\n\n5 Alice thought...\n6 Bob visited..."]

相反,Python 将每个部分拆分如下,这似乎与前瞻断言相矛盾:

["Lorem ipsum",
 "",
 "THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment",
 "",
 "",
 "NEXT SECTION SPANS 200",
 "YEARS AND MANY COUNTRIES!\n\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...",
 "THIS SECTION HAS NO",
 "LINE BREAK / SPACE FROM",
 "THE PREVIOUS ONE\n\n5 Alice thought...\n6 Bob visited..."]

问题

为什么

[^a-z]+
表现得像惰性匹配而不是贪婪匹配?

什么是正确的解决方案?

python regex parsing split regex-lookarounds
1个回答
1
投票

更新示例

我们可以添加一个 lookbehind 来匹配双

\n
(或者如果你不需要尾随
\n\n
则在
\n
上拆分),并在字符集中包含数字。

re.split(r"(?<=\n)\n(?=[A-Z0-9 ]+\n)", text)

(?<=\n)\n(?= *[A-Z][A-Z0-9 ]*\n)
强制至少一个首字母大写。

输出:

['Lorem ipsum\n',
 'THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\n',
 'THIS SECTION SPANS 200\nYEARS AND MANY COUNTRIES\n3 Joe Bloggs saw...\n4 John Doe heard...\n',
 'THIS SECTION IS ALSO A\nLONG STORY ABOUT EVERYTHING\nSINCE 1669\n\n5 Alice thought...\n6 Bob visited...']

正则表达式演示

使用循环

import re

out = ['']
prev_header = True
for line in text.splitlines():
    if line:
        header = bool(re.fullmatch('[^a-z]+', line))
        if header and not prev_header:
            out.append(line+'\n')
        else:
            out[-1] += line+'\n'
        prev_header = header

输出:

['Lorem ipsum\n',
 'THIS SECTION IS A SHORT STORY\n1 Hello world\n2 Bye bye\nSide comment\n',
 'NEXT SECTION SPANS 200\nYEARS AND MANY COUNTRIES!\n3 Joe Bloggs attended a NATO summit\n4 John Doe heard...\n',
 'THIS SECTION HAS NO\nLINE BREAK / SPACE FROM\nTHE PREVIOUS ONE\n5 Alice thought...\n6 Bob visited...\n']
© www.soinside.com 2019 - 2024. All rights reserved.