我有一系列文本块,其中包含写为“2021 年 9 月第一个星期三”或“2022 年 7 月第三个星期一”等的日期。我不确定提取文本和重新格式化的最佳方法它作为标准的“月日,年”格式。我尝试过使用启用模糊匹配的日期查找器库,但“第一个星期二”和其他库都失败了,我相信因为它不是正常的日期格式。任何想法将不胜感激,谢谢大家!
假设文本中的所有日期均为
The cardinal day_of_week of Month, Year
格式(您必须在第二个日期中将 in 替换为 of):
import calendar
import re
text = [
"The first Wednesday of September, 2021",
"The third Monday of July, 2022",
# more dates
]
pattern = r"The (\w+) (\w+) of (\w+), (\d{4})"
cardinal = {
"first": 1,
"second": 2,
"third": 3,
"fourth": 4,
"fifth": 5
}
def find_nth_day_of_week(year_str, month_name, day_of_week, n_str):
year = int(year_str)
month = list(calendar.month_name).index(month_name.capitalize())
if month == 0:
return None
n = cardinal.get(n_str.lower())
if n is None:
return None
cal = calendar.monthcalendar(year, month)
day_index = list(calendar.day_name).index(day_of_week.capitalize())
nth_occurrence = [week[day_index] for week in cal if week[day_index] != 0]
if n > len(nth_occurrence):
return None
day = nth_occurrence[n - 1]
date = f"{calendar.month_abbr[month]} {day}, {year}"
return date
def parse_text(text):
match = re.match(pattern, text)
if match:
cardinal, day_of_week, month, year = match.groups()
return find_nth_day_of_week(year, month, day_of_week, cardinal)
return None
dates = [parse_text(block) for block in text]
for i, date in enumerate(dates):
print(f"Date {i + 1}: {date}")