Python:仅使用RegEx在字符串中的特定单词之后查找完整文本

问题描述 投票:0回答:5

文本如下:

text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment 
slip goods receipt note iz checklist creator id name 30009460 [email protected]
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated 
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka 
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order 
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no 
vill bitta ta naliya abadasa despatched through destination march 18 terms of

目标:我想提取单词“发票”之后的文本,特别是第二次出现的“发票”]

我的方法:

txt = re.findall('invoice (.*)',text)

在上述方法中,我期望的字符串列表如下:

txt = ['in favour of company z 02 cjpc abstract sheet weighment 
    slip goods receipt note iz checklist creator id name 30009460 [email protected]
    checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything 
    written manually on the checklist will not be considered','parth enterprise â invoice no dated 
    kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment 
    taluka ..... #rest of the string]

但是我得到的是text中给出的整个字符串,即原始字符串。如果使用text.partition('invoice'),则无法获得txt中提到的正确字符串。

任何帮助将不胜感激。

有如下文本:文本=支持公司z的文件清单支票01原始发票02 cjpc摘要表称重发票收货单iz清单创建者ID名称30009460 ...

python regex
5个回答
1
投票
如果您要像问题中那样获得2个匹配项,则可以使用2个捕获组。

0
投票
这可以通过split()方法轻松完成例如:

0
投票
您的正则表达式invoice (.*)将与第一个文字invoice匹配,后跟空格,然后(.*)会贪婪地捕获group1中正在发生的其余文本,这是预期的正确行为。

0
投票

更新


0
投票
使用用于分割输入的更简单的正则表达式可以更有效地解决此问题:

import re text= r"""list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 [email protected] checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered invoice parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no vill bitta ta naliya abadasa despatched through destination march 18 terms of""" #matches = re.split(r'\b\s*invoice\s*\b', text)[1:-1] # if arbitrary white space can come before and after "invoice" matches = re.split(r'\b ?invoice ?\b', text)[1:-1] for i, match in enumerate(matches): print(f'\nMatch {i + 1}:\n', match, sep='')

© www.soinside.com 2019 - 2024. All rights reserved.