如何在Python中获取和验证日志文件中的xml内容?

问题描述 投票:-1回答:1

我需要解析一些日志文件,其中的内容是类似XML的,但它没有根,中间有文本内容。

日志文件的格式是。

2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
  <ItemId>373011</ItemId>
  <AreaId>232</AreaId>
  <CarrierId>131</CarrierId>
  <AResult>
    <Measured>Ok</Measured>
  </AResult>
    <TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>

2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
  <ItemId>373012</ItemId>
  <AreaId>232</AreaId>
  <CarrierId>131</CarrierId>
  <AResult>
    <Measured>Ok</Measured>
  </AResult>
    <TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>

既然是日志文件,我可以用ElementTree库吗?我需要验证不同项目ID的Measured OK。

我试过这些,但都不成功:(1)

import xml.etree.ElementTree as ET
import re
with open('C:\lovely\Libraries\site.log') as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
xml logging
1个回答
0
投票

可能无法解析一个包含文本和XML混合在一起的随机片段的文件。文本部分很有可能包含类似XML的东西,但不是很好的格式(如 <\?xml[^>]+\?>);在一般情况下,将其与XML区分开来是不可能的。


0
投票

试试这个。它具有很高的容错性,并将数据作为文本处理。

from simplified_scrapy import SimplifiedDoc
html = '''
2019-09-12 15:30:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:30:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
  <ItemId>373011</ItemId>
  <AreaId>232</AreaId>
  <CarrierId>131</CarrierId>
  <AResult>
    <Measured>Ok</Measured>
  </AResult>
    <TimeStamp>2019-09-12T19:30:02Z</TimeStamp>
</Outcome>

2019-09-12 15:32:02.137 (162,<ThreadPool>    ) Info          Sending:
<Keepalive />
2019-09-12 15:32:03.512 (65 ,Estate            ) DebugInfo     Incoming buffer has 292 bytes
<Outcome>
  <ItemId>373012</ItemId>
  <AreaId>232</AreaId>
  <CarrierId>131</CarrierId>
  <AResult>
    <Measured>Ok</Measured>
  </AResult>
    <TimeStamp>2019-09-12T19:32:02Z</TimeStamp>
</Outcome>
'''
doc = SimplifiedDoc(html)
# Outcome = doc.Outcome
Outcomes = doc.Outcomes 
print(Outcomes.ItemId.text, Outcomes.AreaId.text)

结果是这样的。

['373011', '373012'] ['232', '232']

这里有更多的例子 https:/github.comyiyedatasimplified-scrapy-demotreemasterdoc_examples。

© www.soinside.com 2019 - 2024. All rights reserved.