我有一个字符串列表。如果列表中的任何单词在文档中的一行内匹配,我想得到匹配的单词和将出现在该行中的数字作为输出,主要是在该匹配单词之后。这个词和数字大多由space
或:
分隔
文件示例:
Expedien: 1-21-212-16-26
我的列表:
my_list = ['Reference', 'Ref.', 'tramite', 'Expedien']
匹配字符串行内的数字可以用-
分隔,也可以不用。示例:1-21-22-45
或RE9833
在这种情况下,如果在行内找到列表中的匹配单词,RE9833
应该完全(不仅是数字)。
如何在python中为此编写正则表达式。
输入文件:
$cat input_file
Expedien: 1-21-212-16-26 #other garbage
Reference RE9833 #tralala
abc
123
456
Ref.: UV1234
tramite 1234567
Ref.:
样品:
import re
my_list = ['Reference', 'Ref.', 'tramite', 'Expedien']
#open the file as input
with open('input_file','r') as infile:
#create an empty dict to store the pairs
#that we will extract from the file
res = dict()
#for each input line
for line in infile:
#the only place we will use regex in this code
#we split the input strings in a list of strings using
#as separator : if present followed by some spaces
elems = re.split('(?::)?\s+', line)
#we test that we have at least 2 elements
#if not we continue with the following line
if len(elems) >= 2 :
contains = False
#tmp will store all the keys identfied
tmp = ''
#we go through all the strings present in this list of strings
for elem in elems:
#when we enter this if we have already found the key and we have the value
#at this iteration
if contains:
#we store it in the dict
#reset the check and leave this loop
res.update({tmp : elem})
contains = False
break
#we check if the elem is in my_list
if elem in my_list:
#if this is the case
#we set contains to true and we save the key in tmp
contains = True
tmp = elem
print(res)
输出:
python find_list.py
{'tramite': '1234567', 'Reference': 'RE9833', 'Expedien': '1-21-212-16-26', 'Ref.': ''}
正则表达式演示:https://regex101.com/r/kSmLzW/3/