从 spaCy 中的句子中提取日期

问题描述 投票:0回答:1

我有一个像这样的字符串:

"The dates are from 30 June 2019 to 1 January 2022 inclusive"

我想使用 spaCy 从此字符串中提取日期。

这是迄今为止我的功能:

def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates_with_year.append(ent.text)
    return dates_with_year

这将返回以下输出:

['30 June 2019 to 1 January 2022']

但是,我想要这样的输出:

['30 June 2019', '1 January 2022']
python regex nlp spacy named-entity-recognition
1个回答
0
投票

问题在于

"to"
被视为日期的一部分。因此,当您执行
for ent in doc.ents
时,您的循环仅进行一次迭代,因为
"30 June 2019 to 1 January 2022"
被视为一个实体。

由于您不希望出现这种行为,您可以修改函数以在

"to"
:

上进行拆分
def extract_dates_with_year(text):
    doc = nlp(text)
    dates_with_year = []
    for ent in doc.ents:
        if ent.label_ == "DATE":
            for ent_txt in ent.text.split("to"):
                dates_with_year.append(ent_txt.strip())
    return dates_with_year

这将正确处理这样的日期,以及单个日期和具有多个日期的字符串:

# Create some strings
txt_list = [
     "The dates are from 30 June 2019 to 1 January 2022 inclusive",
     "It was 4 August 2021",
     "We were gone from 21 April 2019 until 21 May 2020"
]

# Run the new function
[extract_dates_with_year(txt) for txt in txt_list]

# Output:
[
    ['30 June 2019', '1 January 2022'],
    ['4 August 2021'],
    ['21 April 2019', '21 May 2020']
]
© www.soinside.com 2019 - 2024. All rights reserved.