文本如下:
text = list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 [email protected]
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of
目标:我想提取单词“发票”之后的文本,特别是第二次出现的“发票”]
我的方法:
txt = re.findall('invoice (.*)',text)
在上述方法中,我期望的字符串列表如下:
txt = ['in favour of company z 02 cjpc abstract sheet weighment slip goods receipt note iz checklist creator id name 30009460 [email protected] checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything written manually on the checklist will not be considered','parth enterprise â invoice no dated kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka ..... #rest of the string]
但是我得到的是
text
中给出的整个字符串,即原始字符串。如果使用text.partition('invoice')
,则无法获得txt
中提到的正确字符串。
任何帮助将不胜感激。
有如下文本:文本=支持公司z的文件清单支票01原始发票02 cjpc摘要表称重发票收货单iz清单创建者ID名称30009460 ...
invoice (.*)
将与第一个文字invoice
匹配,后跟空格,然后(.*)
会贪婪地捕获group1中正在发生的其余文本,这是预期的正确行为。更新
import re
text= r"""list of documents check 01 original invoice in favour of company z 02 cjpc abstract sheet weighment
slip goods receipt note iz checklist creator id name 30009460 [email protected]
checklist creation date 31 03 2018 checklist print date time 31 03 2018 10 45 57 note anything
written manually on the checklist will not be considered invoice parth enterprise â invoice no dated
kashish aarcade baroi road 18 25 mar 2018 village baroi delivery note mode terms of payment taluka
mundra kutch supplierâ s ref other reference s gst no 24acypt3861 c1 z 7 dated buyer i buyer s order
no 21 jun 2017 abc corporation 5700214006 â dated 40 mwp solar power plant i despatch document no
vill bitta ta naliya abadasa despatched through destination march 18 terms of"""
#matches = re.split(r'\b\s*invoice\s*\b', text)[1:-1] # if arbitrary white space can come before and after "invoice"
matches = re.split(r'\b ?invoice ?\b', text)[1:-1]
for i, match in enumerate(matches):
print(f'\nMatch {i + 1}:\n', match, sep='')