我有一个数据框,其中的文本列包含多种格式的日期。我已经为所有格式编写了正则表达式。我可以单独运行正则表达式,但是当我尝试在数据帧上一次运行所有正则表达式时,我不断收到错误消息“重新错误:将组名'month'重新定义为组4;在位置66处是组1”
d = [{'text':'03/25/93 Total time of visit (in minutes):'}, {'text':'April 11, 1990 CPT Code: 90791: No medical services'},
{'text':'29 Jan 1994 Primary Care Doctor:'}, {'text':'s1981 Swedish-American Hospital'}]
mdf = pd.DataFrame(d, index=[1,2,3,4])
regexpattern1 = r'(?P<month>\b\d{1,2})[/-](?P<day>\d{1,2})[/-](?P<year>\d{2})\b'
regexpattern2 = r'(?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[.]?[a-z]*(?:,|\s|\-)?(?P<day>\d{2})(?:\-|,|\s)? (?P<year>\d{4})'
regexpattern3 = r'(?P<day>\d{2}) (?P<month>(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec))[.]?[a-z]*[,]? (?P<year>\d{4})'
regexpattern4 = r'(?P<month>)(?P<day>)\b[a-za-z]+(?P<year>\d{4})'
# mdf[['month', 'day', 'year']] = mdf['text'].str.extract(regexpattern4) # runs individually
mdf[['month', 'day', 'year']] = mdf['text'].str.extract("|".join([regexpattern1, regexpattern2, regexpattern3, regexpattern4])) # raises error
print(mdf)
Expected Output:
text month day year
1 03/25/93 Total time of visit (in minutes): 03 25 93
2 April 11, 1990 CPT Code: 90791: No medical services Apr 11 1990
3 29 Jan 1994 Primary Care Doctor: Jan 29 1994
4 s1981 Swedish-American Hospital NaN NaN 1981
使用datefinder
的解决方案:
datefinder
此产量
import datefinder, pandas as pd, numpy as np
string = """
03/25/93 Total time of visit (in minutes):
April 11, 1990 CPT Code: 90791: No medical services
29 Jan 1994 Primary Care Doctor:
s1981 Swedish-American Hospital
"""
result = []
loop = (line for line in string.split("\n") if line)
for line in loop:
try:
date = next(m for m in datefinder.find_dates(line))
except:
date = np.nan
result.append([line, date])
df = pd.DataFrame.from_records(result, columns=["text", "date"])
print(df)
您的原始方法遇到了两个问题,您实际上正在寻找 text date
0 03/25/93 Total time of visit (in minutes): 1993-03-25
1 April 11, 1990 CPT Code: 90791: No medical ser... 1990-04-11
2 29 Jan 1994 Primary Care Doctor: 1994-01-29
3 s1981 Swedish-American Hospital NaT
。