我有一个像这样的字符串:
"The dates are from 30 June 2019 to 1 January 2022 inclusive"
我想使用 spaCy 从此字符串中提取日期。
这是迄今为止我的功能:
def extract_dates_with_year(text):
doc = nlp(text)
dates_with_year = []
for ent in doc.ents:
if ent.label_ == "DATE":
dates_with_year.append(ent.text)
return dates_with_year
这将返回以下输出:
['30 June 2019 to 1 January 2022']
但是,我想要这样的输出:
['30 June 2019', '1 January 2022']
问题在于
"to"
被视为日期的一部分。因此,当您执行 for ent in doc.ents
时,您的循环仅进行一次迭代,因为 "30 June 2019 to 1 January 2022"
被视为一个实体。
由于您不希望出现这种行为,您可以修改函数以在
"to"
: 上进行拆分
def extract_dates_with_year(text):
doc = nlp(text)
dates_with_year = []
for ent in doc.ents:
if ent.label_ == "DATE":
for ent_txt in ent.text.split("to"):
dates_with_year.append(ent_txt.strip())
return dates_with_year
这将正确处理这样的日期,以及单个日期和具有多个日期的字符串:
# Create some strings
txt_list = [
"The dates are from 30 June 2019 to 1 January 2022 inclusive",
"It was 4 August 2021",
"We were gone from 21 April 2019 until 21 May 2020"
]
# Run the new function
[extract_dates_with_year(txt) for txt in txt_list]
# Output:
[
['30 June 2019', '1 January 2022'],
['4 August 2021'],
['21 April 2019', '21 May 2020']
]