在找到其中的特定文本后如何选择整个文本块?

问题描述 投票:0回答:1

例如,我想查找名为

jemima
的人是否有 4 级。

我使用代码

JEMIMA((?:(?:(?!JEMIMA|DIVISION 4).|[\r\n])*?)DIVISION 4)

34
DIVISION 0
CIV-'F' HIST-'F' GEO-'F' KISW-'D' ENGL-'F' PHY-'F' CHEM-'F' BIO-'F' B/MATH-'F' 


S1147/0173
20150987402
JEMIMA SILVESTER SANGAWE
F
28
DIVISION 4
CIV-'D' HIST-'F' GEO-'D' KISW-'C' ENGL-'C' PHY-'F' CHEM-'F' BIO-'D' B/MATH-'F' 

S1148/0173
20150987403

但它只选择

JEMIMA SILVESTER SANGAWE
F
28
DIVISION 4

**我想选择整个区块**

S1147/0173
20150987402
JEMIMA SILVESTER SANGAWE
F
28
DIVISION 4
CIV-'D' HIST-'F' GEO-'D' KISW-'C' ENGL-'C' PHY-'F' CHEM-'F' BIO-'D' B/MATH-'F' 

找到文本后请帮我提取整个段落。

python regex notepad++ sublimetext3
1个回答
0
投票

检查以下正则表达式。

^(?=(?:(?!\n\n)[\S\s])*?\bJEMIMA\b)(?=(?:(?!\n\n)[\S\s])*?\bDIVISION\ 4\b).+(?:\n.+)*

此正则表达式检查文本中是否以任意顺序同时存在

JEMIMA
DIVISION 4
,并且可以细分如下。

^             the beginning of the string
---------------------------------------------------------
(?=           look ahead to see if there is:
---------------------------------------------------------
  (?:           group, but do not capture (0 or more
                times (matching the least amount
                possible)):
---------------------------------------------------------
    (?!           look ahead to see if there is not:
---------------------------------------------------------
      \n            '\n' (newline)
---------------------------------------------------------
      \n            '\n' (newline)
---------------------------------------------------------
    )             end of look-ahead
---------------------------------------------------------
    [\S\s]        any character of: non-whitespace (all
                  but \n, \r, \t, \f, and " "),
                  whitespace (\n, \r, \t, \f, and " ")
---------------------------------------------------------
  )*?           end of grouping
---------------------------------------------------------
  \b            the boundary between a word char (\w)
                and something that is not a word char
---------------------------------------------------------
  JEMIMA        'JEMIMA'
---------------------------------------------------------
  \b            the boundary between a word char (\w)
                and something that is not a word char
---------------------------------------------------------
)             end of look-ahead
---------------------------------------------------------
(?=           look ahead to see if there is:
---------------------------------------------------------
  (?:           group, but do not capture (0 or more
                times (matching the least amount
                possible)):
---------------------------------------------------------
    (?!           look ahead to see if there is not:
---------------------------------------------------------
      \n            '\n' (newline)
---------------------------------------------------------
      \n            '\n' (newline)
---------------------------------------------------------
    )             end of look-ahead
---------------------------------------------------------
    [\S\s]        any character of: non-whitespace (all
                  but \n, \r, \t, \f, and " "),
                  whitespace (\n, \r, \t, \f, and " ")
---------------------------------------------------------
  )*?           end of grouping
---------------------------------------------------------
  \b            the boundary between a word char (\w)
                and something that is not a word char
---------------------------------------------------------
  DIVISION 4    'DIVISION 4'
---------------------------------------------------------
  \b            the boundary between a word char (\w)
                and something that is not a word char
---------------------------------------------------------
)             end of look-ahead
---------------------------------------------------------
.+            any character except \n (1 or more times
              (matching the most amount possible))
---------------------------------------------------------
(?:           group, but do not capture (0 or more times
              (matching the most amount possible)):
---------------------------------------------------------
  \n            '\n' (newline)
---------------------------------------------------------
  .+            any character except \n (1 or more times
                (matching the most amount possible))
---------------------------------------------------------
)*            end of grouping

使用此正则表达式的 Python 程序示例。

import re

test_str = """
34
DIVISION 0
CIV-'F' HIST-'F' GEO-'F' KISW-'D' ENGL-'F' PHY-'F' CHEM-'F' BIO-'F' B/MATH-'F'

34
DIVISION 4
CIV-'F' HIST-'F' GEO-'F' KISW-'D' ENGL-'F' PHY-'F' CHEM-'F' BIO-'F' B/MATH-'F'
JEMIMA

S1147/0173
20150987402
JEMIMA SILVESTER SANGAWE
F
28
DIVISION 4
CIV-'D' HIST-'F' GEO-'D' KISW-'C' ENGL-'C' PHY-'F' CHEM-'F' BIO-'D' B/MATH-'F' 

S1148/0173
20150987403

S1148/
0173/
JEMIMA/
20150987403 (DIVISION 4)
"""

r = r"^(?=(?:(?!\n\n)[\S\s])*?\bJEMIMA\b)(?=(?:(?!\n\n)[\S\s])*?\bDIVISION\ 4\b).+(?:\n.+)*"
rx = re.compile(r, re.MULTILINE)
print('\n\n'.join(m.group() for m in rx.finditer(test_str)))

输出:

34
DIVISION 4
CIV-'F' HIST-'F' GEO-'F' KISW-'D' ENGL-'F' PHY-'F' CHEM-'F' BIO-'F' B/MATH-'F'
JEMIMA

S1147/0173
20150987402
JEMIMA SILVESTER SANGAWE
F
28
DIVISION 4
CIV-'D' HIST-'F' GEO-'D' KISW-'C' ENGL-'C' PHY-'F' CHEM-'F' BIO-'D' B/MATH-'F' 

S1148/
0173/
JEMIMA/
20150987403 (DIVISION 4)

4th DIVISION
应该在
JEMIMA
之后吗?
尝试``.

© www.soinside.com 2019 - 2024. All rights reserved.