从多个正则表达式命名组中提取

问题描述 投票:0回答:1

我有一个数据框,其中的文本列包含多种格式的日期。我已经为所有格式编写了正则表达式。我可以单独运行正则表达式,但是当我尝试在数据帧上一次运行所有正则表达式时,我不断收到错误消息“重新错误:将组名'month'重新定义为组4;在位置66处是组1”

 d = [{'text':'03/25/93 Total time of visit (in minutes):'}, {'text':'April 11, 1990 CPT Code: 90791: No medical services'},
         {'text':'29 Jan 1994 Primary Care Doctor:'}, {'text':'s1981  Swedish-American Hospital'}]
mdf = pd.DataFrame(d, index=[1,2,3,4])

regexpattern1 = r'(?P<month>\b\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2})\b'
regexpattern2 = r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[.]?[a-z]*(?:,|\s|\-)?(?P<day>\d{2})(?:\-|,|\s)? (?P<year>\d{4})'
regexpattern3 = r'(?P<day>\d{2}) (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[.]?[a-z]*[,]? (?P<year>\d{4})'
regexpattern4 = r'(?P<month>)(?P<day>)\b[a-za-z]+(?P<year>\d{4})'
# mdf[['month', 'day', 'year']] = mdf['text'].str.extract(regexpattern4) # runs individually
mdf[['month', 'day', 'year']] = mdf['text'].str.extract("|".join([regexpattern1, regexpattern2, regexpattern3, regexpattern4])) # raises error
print(mdf)

Expected Output:
                                                  text month  day year
1           03/25/93 Total time of visit (in minutes):    03   25   93
2  April 11, 1990 CPT Code: 90791: No medical services   Apr   11  1990
3                     29 Jan 1994 Primary Care Doctor:   Jan   29   1994
4                     s1981  Swedish-American Hospital   NaN  NaN  1981
regex pandas date-parsing re
1个回答
0
投票

使用datefinder的解决方案:

datefinder

此产量

import datefinder, pandas as pd, numpy as np

string = """
03/25/93 Total time of visit (in minutes):
April 11, 1990 CPT Code: 90791: No medical services
29 Jan 1994 Primary Care Doctor:
s1981  Swedish-American Hospital
"""

result = []
loop = (line for line in string.split("\n") if line)
for line in loop:
    try:
        date = next(m for m in datefinder.find_dates(line))
    except:
        date = np.nan

    result.append([line, date])

df = pd.DataFrame.from_records(result, columns=["text", "date"])
print(df)

您的原始方法遇到了两个问题,您实际上正在寻找 text date 0 03/25/93 Total time of visit (in minutes): 1993-03-25 1 April 11, 1990 CPT Code: 90791: No medical ser... 1990-04-11 2 29 Jan 1994 Primary Care Doctor: 1994-01-29 3 s1981 Swedish-American Hospital NaT

© www.soinside.com 2019 - 2024. All rights reserved.