正则表达式通配符与多个匹配对象匹配

问题描述 投票:0回答:1

我编写了解析结构化文档的代码(意味着每个段落都有一个标题 - 例如 1.1.、1.1.1.、2.1.)并返回每个段落的字典及其参考和参考文本。我遇到了一个问题,其中存在多个匹配选项,并且该函数返回两个匹配中较大的一个,这不是我正在寻找的那个。

例如:

text = """
    1.1. Hello World!
    1.2. This is the second section and it's a beautiful day.
    ....
    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2. """

我用来搜索这个的功能是:

section = re.match(\s+1\.1\.\s+(.*)\s+1\.2\.\s+, text).group()

我想回来--

section = "Hello World!"

我实际上得到的是——

section = """Hello World!
    1.2. This is the second section and it's a beautiful day.
    ....
    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph """

我尝试使用 .group(2) 并希望能够选择所需的匹配对象,但只存在一个匹配对象。

如何获得较小的匹配对象?

python regex match regex-group
1个回答
0
投票

此任务并不像看起来那么简单,取决于您的边缘情况。

正如评论所述,您可以尝试使用惰性量词。

代码:

import re

s = """
1.1. Hello World!
    1.2. This is the second section and it's a beautiful day.
    ....
    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2. 
1.1. Hello World!
    1.2. This is the second section and it's a beautiful day.
            1.2.1. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
    
    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2. 
    1.2. This is the second section and it's a beautiful day.
            1.2.1. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.6. This is the second section and it's a beautiful day.
                        1.2.3.4.6.7. This is the second section and it's a beautiful day.
    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2.
    1.2. This is the second section and it's a beautiful day.
            1.2.1. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
            1.2.3. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.6. This is the second section and it's a beautiful day.
                        1.2.3.4.6.7. This is the second section and it's a beautiful day.

    8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2. 
    1.2. This is the second section and it's a beautiful day.
            1.2.1. This is the second section and it's a beautiful day.
            1.2.3. This is the second section 8, 9, 98784567 and it's a beautiful day.
            1.2.3. This is the second section $4.48 and it's a beautiful day.
            1.2.3. This is the second section 6536.32 .2.2.2.1. and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second section and it's a beautiful day.
                 1.2.3.4.5. This is the second 18419347b1934 817 7837 section and it's a beautiful day.
                 1.2.3.4.6. This is the second section and it's a beautiful day.
                        1.2.3.4.6.7. This is the second section and it's a beautiful day.


"""

p = r"(?:[0-9]+\.)+\s+[\w\W]+?(?=(?:[0-9]+\.){2,}|$)"

find_pars = re.findall(p, s)

for i, par in enumerate(find_pars):
    print(f"Found at index {i}->: {par}")


打印

Found at index 0->: 1.1. Hello World!
    
Found at index 1->: 1.2. This is the second section and it's a beautiful day.
    ....
    
Found at index 2->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 
Found at index 3->: 1.2. 
1.1. Hello World!
    
Found at index 4->: 1.2. This is the second section and it's a beautiful day.
            
Found at index 5->: 1.2.1. This is the second section and it's a beautiful day.
            
Found at index 6->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 7->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 8->: 1.2.3. This is the second section and it's a beautiful day.
    
    
Found at index 9->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 
Found at index 10->: 1.2. 
    1.2. This is the second section and it's a beautiful day.
            
Found at index 11->: 1.2.1. This is the second section and it's a beautiful day.
            
Found at index 12->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 13->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 14->: 1.2.3. This is the second section and it's a beautiful day.
                 
Found at index 15->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 16->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 17->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 18->: 1.2.3.4.6. This is the second section and it's a beautiful day.
                        
Found at index 19->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.
    
Found at index 20->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 
Found at index 21->: 1.2.
    1.2. This is the second section and it's a beautiful day.
            
Found at index 22->: 1.2.1. This is the second section and it's a beautiful day.
            
Found at index 23->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 24->: 1.2.3. This is the second section and it's a beautiful day.
            
Found at index 25->: 1.2.3. This is the second section and it's a beautiful day.
                 
Found at index 26->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 27->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 28->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 29->: 1.2.3.4.6. This is the second section and it's a beautiful day.
                        
Found at index 30->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.

    
Found at index 31->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 
Found at index 32->: 1.2. 
    1.2. This is the second section and it's a beautiful day.
            
Found at index 33->: 1.2.1. This is the second section and it's a beautiful day.
            
Found at index 34->: 1.2.3. This is the second section 8, 9, 98784567 and it's a beautiful day.
            
Found at index 35->: 1.2.3. This is the second section $4.48 and it's a beautiful day.
            
Found at index 36->: 1.2.3. This is the second section 6536.32 .
Found at index 37->: 2.2.2.1. and it's a beautiful day.
                 
Found at index 38->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 39->: 1.2.3.4.5. This is the second section and it's a beautiful day.
                 
Found at index 40->: 1.2.3.4.5. This is the second 18419347b1934 817 7837 section and it's a beautiful day.
                 
Found at index 41->: 1.2.3.4.6. This is the second section and it's a beautiful day.
                        
Found at index 42->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.


注:

  • 您通常需要浏览输入字符串。
  • 设计一个“解析器”算法,结合使用正则表达式和其他字符串匹配方法来解析数据。

如前所述,这对于几种边缘情况会失败(例如,缺少点、“1.2.3”)。

© www.soinside.com 2019 - 2024. All rights reserved.