我试图解析以下文本
输入:
value1 @ M Temperature 1.30 ohmm @ 74 00 degF
value2 Q M. Temperature 1 70 ohmm @ 74.00 degF
value3 @ m Temperature 110 ohmm @ 74.00 degF
预期产出:
value1 = 1.30
value1 temp = 74.00 degF
等等
我能够从文件中提取文本,但是有问题使OCR输出中的差异动态可解析。因此,当它显示为“临时”时,它仍会在其他事物中拉出预期值。
import re
with open('C:\Users\NthDS1\Documents\samp.txt', 'r') as f:
data = list()
group = dict()
for key, value in re.findall(r'(.*)Temperature\s*([\dE+-.]+)', f.read()):
if key in group:
data.append(group)
group = dict()
group[key] = value
data.append(group)
print data
你可以试试这个:
import re
data = [dict(zip(['name', 'ohmm', 'degF'], re.findall('^[a-zA-Z0-9]+|[\d\.]+(?=\sohmm)|[\d\.]+(?=\sdegF$)', i.strip('\n'))) for i in open('filename.txt')]
输出:
[{'name': 'value1', 'ohmm': '1.30', 'degF': '00'}, {'name': 'value2', 'ohmm': '70', 'degF': '74.00'}, {'name': 'value3', 'ohmm': '110', 'degF': '74.00'}]
如果这是唯一的差异(点的空间),你可以试试
\b(?P<value>\d+(?:[. ]\d+)?)\b\s*
(?P<unit>\w+)
import re
data = """
value1 @ M Temperature 1.30 ohmm @ 74 00 degF
value2 Q M. Temperature 1 70 ohmm @ 74.00 degF
value3 @ m Temperature 110 ohmm @ 74.00 degF
"""
rx = re.compile(r'''
\b(?P<value>\d+(?:[. ]\d+)?)\b\s*
(?P<unit>\w+)''', re.X)
def afterwork(match):
value = match.replace(' ', '.')
try:
value = float(value)
except ValueError:
pass
return value
values = [(afterwork(m.group('value')), m.group('unit'))
for m in rx.finditer(data)]
print(values)
生产
[(1.3, 'ohmm'), (74.0, 'degF'), (1.7, 'ohmm'), (74.0, 'degF'), (110.0, 'ohmm'), (74.0, 'degF')]