有没有一种方法可以以编程方式从自由文本中提取条款(合同长度)

问题描述 投票:0回答:1

我想从文本中提取合同长度到期限(以月为单位)。自由文本字段的范围包括:

"2 x 5 year terms",
"3 further  x 4 years",
"two(2) further terms of five(5) years each",
"Two (2) Years + Two (2) Years + Two (2) Years",
"1 years + 1 years + 1 years" ,
"2 x 3 years",
"1 year and 6 months",
"

我希望输出为:

120 months,
144 months,
120 months,
72 months, 
36 months
72 months
18 months
import re

def calculate_duration(term):
    term = term.lower()

    # Handle "x year terms" pattern
    match = re.match(r'(\d+) x (\d+) year terms?', term)
    if match:
        return int(match.group(1)) * int(match.group(2)) * 12
    
    # Handle "FURTHER TERMS OF x YEARS EACH" pattern
    match = re.match(r'further terms of (\d+) years each', term)
    if match:
        return int(match.group(1)) * 12


    # Handle "FURTHER TERMS OF x YEARS EACH" pattern
    match = re.match(r'further terms of (\d+) years each', term)
    if match:
        return int(match.group(1)) * 12
    
    # Handle "FURTHER TERMS OF x YEARS EACH" pattern
    match = re.match(r'further terms of ((?:\d+\s?\(\w+\)\s?)?(\d+)) years each', term)
    if match:
        return int(match.group(2)) * 12

    # Handle "x years + x years + x years" pattern
    match = re.match(r'(\d+) years(\s?\+\s?\d+ years)+', term)
    if match:
        return sum(int(match.group(1)) for group in match.groups()) * 12

    # Handle other patterns or simple year counts
    match = re.match(r'(\d+) years?', term)
    if match:
        return int(match.group(1)) * 12

    # Handle other cases or unknown patterns
    return None

# Example usage
terms = [
    "2 x 5 year terms",
    "3 further x 4 YEAR terms",
    "Two (2) Years + Two (2) Years + Two (2) Years",
    "1 years + 1 years + 1 years" ,
    "2 x 3 years"
]

for term in terms:
    duration = calculate_duration(term)
    print(f"{term}: {duration} months")

python regex nltk spacy text-extraction
1个回答
0
投票

“...我想从文本中提取合同长度到期限(以月为单位)。...”

利用 eval 内置函数

遍历文本,附加相应的值;数字和运算符。
当遇到“年份”值时,相应调整之前的值;乘以12

从这里,通过连接值生成数学表达式。

这是一个例子。

import re

def parse(s: str):
    e = []
    for i, x in enumerate(s.split()):
        if any([c.isdigit() for c in x]):
            e.append(int(re.sub(r'\D', '', x)))
        elif 'year' in x.lower(): e[-1] *= 12
        elif x in ['x', 'of']: e.append('*')
        elif x in ['+', 'and']: e.append('+')
    return e

text = ['2 x 5 year terms',
        '3 further x 4 years',
        'two(2) further terms of five(5) years each',
        'Two (2) Years + Two (2) Years + Two (2) Years',
        '1 years + 1 years + 1 years',
        '2 x 3 years',
        '1 year and 6 months']
for string in text:
    exp = ' '.join(map(str, parse(string)))
    print(exp, '=', eval(exp))

输出

2 * 60 = 120
3 * 48 = 144
2 * 60 = 120
24 + 24 + 24 = 72
12 + 12 + 12 = 36
2 * 36 = 72
12 + 6 = 18
© www.soinside.com 2019 - 2024. All rights reserved.