不同日期格式的正则表达式模式似乎无法捕获所有所需的情况

问题描述 投票:0回答:2

我正在尝试每行处理一个 txt 文件行,以查找不同模式的日期信息,并将它们写入一致的 YYYY、YYYY-MM 和 YYYY-MM-DD 格式。

我在不同行叙述文本中的输入格式是:

1) YYYY, e.g. 1890
2) MM.YYYY, e.g. 10.1765
3) M.YYYY, e.g. 9.1700
4) DD.MM.YYYY, e.g. 11.11.1876
5) D.MM.YYYY, e.g. 9.10.1678
6) D.M.YYYY, e.g. 9.1.1768
7) DD.M.YYYY, e.g. 21.3.1789
8) DD.MM., e.g. 12.12. (no year)
9) D.M., e.g. 1.1. (no year)

在上下文中,相关行通常如下所示:

"7.1.1695 jur. geprüft"
"12.1. verteidigt unter v. Haaren"
"ord. [Weihe] Mainz 21.9.1743"
"erhielt 1786 die Pfarrei Irmstraut (Diöz. Trier)"
"ein Anton Alperstätt 20.9. 1748 bacc."

我一直在尝试编写一个正则表达式模式来捕获尽可能多的这些情况,并用我想要的输出格式替换识别的字符串(从更扩展的代码中提取):

    def replace_dates(merged_lines):
        def format_date(search):
            year = ""
            month = ""
            day = ""
            # check how many groups are in the pattern
            num_groups = search.groups()
            print(num_groups)
            if len(num_groups) == 0:
                year = "0000"
            if len(num_groups) == 1:
                year = search.group(0)
                print(year)
            elif 2 < len(num_groups) <= 4:
                day = search.group(1) if search.group(1) else '00'
                month = search.group(3) if search.group(3) else '00'
            elif num_groups == 5:
                day = search.group(1) if search.group(1) else '00'
                month = search.group(3) if search.group(3) else '00'
                year = search.group(5)

            # determine the output format
            if month != '00' and day != '00':
                return f"{year}-{month}-{day}"  # Format: YYYY-MM-DD
            elif month != '00':
                return f"{year}-{month}"         # Format: YYYY-MM
            else:
                return year                      # Format: YYYY

        # Compile regex pattern
        date_pattern = re.compile(r'(?<!\d)(\d{1,2})([.-]|\s)?(\d{1,2})?([.-]|\s)?(\d{4})(?!\d)')

        #Group 1: day (1 or 2 digits)
        #Group 2: separator between day and month (if present)
        #Group 3: month (1 or 2 digits)
        #Group 4: separator between month and year (if present)
        #Group 5: year (4 digits)

        replaced_lines = []

        for line in merged_lines:
            if line.startswith("[Source]"):
                replaced_lines.append(line)
            else:
                searches = date_pattern.finditer(line)
                for search in searches:
                    line = line.replace(search.group(0), format_date(search))
                replaced_lines.append(line)

        return replaced_lines

但是,查看正在捕获的搜索,我有几个问题:

  1. 仅存在年份的实例在某种程度上被忽略了。
  2. 我的数据中有时间范围的情况,例如“1786-1790”,检索为 ('17', None, '86', '-', '1790')。

如果每行中有多个日期,我很乐意只处理第一个日期以保持简单。我还考虑过放弃单个正则表达式的想法并单独处理每个案例,但我担心这会使脚本变得不必要的复杂。

python regex match
2个回答
0
投票

鉴于提供的示例并添加其他一些

import re
from datetime import datetime

lpatt = r"^(.+ )?([0-9]{1,2}\.)([0-9]{1,2}\.){0,2}( ?[0-9]{4})?( .+)?$"

cpatt = re.compile(lpatt)

lines = [
    "asd 1.12.1698 dssd asdasd",
    "fdsfd 3.1254 cdcd sss",
    "some 31.12.1600 next",
    "some 7.1.1492 next",
    "some 31.12. next",
    "some next",
    "7.1.1695 jur. geprüft",
    "12.1. verteidigt unter v. Haaren",
    "ord. [Weihe] Mainz 21.9.1743",
    "erhielt 1786 die Pfarrei Irmstraut (Diöz. Trier)",
    "ein Anton Alperstätt 20.9. 1748 bacc."
]

for current in lines:
    result = cpatt.findall(current)
    if len(result) == 0:
        continue
    
    dparts = [ p for p in result[0][1:4] if p != '' ]
    #print(dparts)
    if len(dparts) == 2 and len(dparts[1]) < 4:
        dfmt = datetime.strptime(f"{dparts[0]}{dparts[1]}", "%d.%m.").strftime("%m-%d")
        print(f"{result[0][0]}{dfmt}{result[0][4]}")
    elif len(dparts) == 2 and len(dparts[1]) == 4:
        dfmt = datetime.strptime(f"{dparts[0]}{dparts[1]}", "%m.%Y").strftime("%Y-%m")
        print(f"{result[0][0]}{dfmt}{result[0][4]}")
    elif len(dparts) == 3:
        year = dparts[2].replace(' ', '')
        dfmt = datetime.strptime(f"{dparts[0]}{dparts[1]}{year}", "%d.%m.%Y").strftime("%Y-%m-%d")
        print(f"{result[0][0]}{dfmt}{result[0][4]}")
    else:
        print(current)

结果

asd 1698-12-01 dssd asdasd
fdsfd 1254-03 cdcd sss
some 1600-12-31 next
some 1492-01-07 next
some 12-31 next
1695-01-07 jur. geprüft
01-12 verteidigt unter v. Haaren
ord. [Weihe] Mainz 1743-09-21
ein Anton Alperstätt 1748-09-20 bacc.

0
投票

根据您的示例数据,您可以使用此正则表达式来匹配您的日期:

(?<!\d)(?=\d)(?:(?P<day>\d\d?)\.)?(?:(?P<month>\d\d?)\.)?(?P<year>\d{4})?(?!\d)

此匹配:

  • (?<!\d)(?=\d)
    :数字的负向后查找和数字的正向前查找;这确保我们的匹配从数字序列的开头开始
  • (?:(?P<day>\d\d?)\.)?
    :可选的天数,后跟
    .
  • (?:(?P<month>\d\d?)\.)?
    :可选月份号,后跟
    .
  • (?P<year>\d{4})?
    :可选年份
  • (?!\d)
    :数字的负前瞻 - 这确保可选组之一必须匹配,因为正则表达式开头的数字的正前瞻

正则表达式演示 regex101

捕获组后,您可以将值重新格式化为

YYYY-MM-DD
格式:

lines = [
    "7.1.1695 jur. geprüft",
    "12.1. verteidigt unter v. Haaren",
    "ord. [Weihe] Mainz 21.9.1743",
    "erhielt 1786 die Pfarrei Irmstraut (Diöz. Trier)",
    "ein Anton Alperstätt 20.9. 1748 bacc."
]

def formatter(m):
    formats = { 'year' : 4, 'month' : 2, 'day' : 2 }
    parts = [ f'{m.group(g).zfill(formats[g])}' for g in formats if m.group(g) ]
    return '-'.join(parts)

pattern = re.compile(r'(?<!\d)(?=\d)(?:(?P<day>\d\d?)\.)?(?:(?P<month>\d\d?)\.)?(?P<year>\d{4})?(?!\d)')

for line in lines:
    print(pattern.sub(formatter, line))

输出:

1695-01-07 jur. geprüft
01-12 verteidigt unter v. Haaren
ord. [Weihe] Mainz 1743-09-21
erhielt 1786 die Pfarrei Irmstraut (Diöz. Trier)
ein Anton Alperstätt 09-20 1748 bacc.
© www.soinside.com 2019 - 2024. All rights reserved.