我编写了解析结构化文档的代码(意味着每个段落都有一个标题 - 例如 1.1.、1.1.1.、2.1.)并返回每个段落的字典及其参考和参考文本。我遇到了一个问题,其中存在多个匹配选项,并且该函数返回两个匹配中较大的一个,这不是我正在寻找的那个。
例如:
text = """
1.1. Hello World!
1.2. This is the second section and it's a beautiful day.
....
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2. """
我用来搜索这个的功能是:
section = re.match(\s+1\.1\.\s+(.*)\s+1\.2\.\s+, text).group()
我想回来--
section = "Hello World!"
我实际上得到的是——
section = """Hello World!
1.2. This is the second section and it's a beautiful day.
....
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph """
我尝试使用 .group(2) 并希望能够选择所需的匹配对象,但只存在一个匹配对象。
如何获得较小的匹配对象?
此任务并不像看起来那么简单,取决于您的边缘情况。
正如评论所述,您可以尝试使用惰性量词。
import re
s = """
1.1. Hello World!
1.2. This is the second section and it's a beautiful day.
....
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2.
1.1. Hello World!
1.2. This is the second section and it's a beautiful day.
1.2.1. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2.
1.2. This is the second section and it's a beautiful day.
1.2.1. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.6. This is the second section and it's a beautiful day.
1.2.3.4.6.7. This is the second section and it's a beautiful day.
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2.
1.2. This is the second section and it's a beautiful day.
1.2.1. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.6. This is the second section and it's a beautiful day.
1.2.3.4.6.7. This is the second section and it's a beautiful day.
8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph 1.2.
1.2. This is the second section and it's a beautiful day.
1.2.1. This is the second section and it's a beautiful day.
1.2.3. This is the second section 8, 9, 98784567 and it's a beautiful day.
1.2.3. This is the second section $4.48 and it's a beautiful day.
1.2.3. This is the second section 6536.32 .2.2.2.1. and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second section and it's a beautiful day.
1.2.3.4.5. This is the second 18419347b1934 817 7837 section and it's a beautiful day.
1.2.3.4.6. This is the second section and it's a beautiful day.
1.2.3.4.6.7. This is the second section and it's a beautiful day.
"""
p = r"(?:[0-9]+\.)+\s+[\w\W]+?(?=(?:[0-9]+\.){2,}|$)"
find_pars = re.findall(p, s)
for i, par in enumerate(find_pars):
print(f"Found at index {i}->: {par}")
Found at index 0->: 1.1. Hello World!
Found at index 1->: 1.2. This is the second section and it's a beautiful day.
....
Found at index 2->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph
Found at index 3->: 1.2.
1.1. Hello World!
Found at index 4->: 1.2. This is the second section and it's a beautiful day.
Found at index 5->: 1.2.1. This is the second section and it's a beautiful day.
Found at index 6->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 7->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 8->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 9->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph
Found at index 10->: 1.2.
1.2. This is the second section and it's a beautiful day.
Found at index 11->: 1.2.1. This is the second section and it's a beautiful day.
Found at index 12->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 13->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 14->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 15->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 16->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 17->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 18->: 1.2.3.4.6. This is the second section and it's a beautiful day.
Found at index 19->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.
Found at index 20->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph
Found at index 21->: 1.2.
1.2. This is the second section and it's a beautiful day.
Found at index 22->: 1.2.1. This is the second section and it's a beautiful day.
Found at index 23->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 24->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 25->: 1.2.3. This is the second section and it's a beautiful day.
Found at index 26->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 27->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 28->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 29->: 1.2.3.4.6. This is the second section and it's a beautiful day.
Found at index 30->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.
Found at index 31->: 8.1. This is the eighth section and it's a beautiful day as previously stated in paragraph
Found at index 32->: 1.2.
1.2. This is the second section and it's a beautiful day.
Found at index 33->: 1.2.1. This is the second section and it's a beautiful day.
Found at index 34->: 1.2.3. This is the second section 8, 9, 98784567 and it's a beautiful day.
Found at index 35->: 1.2.3. This is the second section $4.48 and it's a beautiful day.
Found at index 36->: 1.2.3. This is the second section 6536.32 .
Found at index 37->: 2.2.2.1. and it's a beautiful day.
Found at index 38->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 39->: 1.2.3.4.5. This is the second section and it's a beautiful day.
Found at index 40->: 1.2.3.4.5. This is the second 18419347b1934 817 7837 section and it's a beautiful day.
Found at index 41->: 1.2.3.4.6. This is the second section and it's a beautiful day.
Found at index 42->: 1.2.3.4.6.7. This is the second section and it's a beautiful day.
如前所述,这对于几种边缘情况会失败(例如,缺少点、“1.2.3”)。