我有一个 HTML 文件,它的底部包含 XML 并附有注释,它看起来像这样:
<!DOCTYPE html>
<html>
<head>
***
</head>
<body>
<div class="panel panel-primary call__report-modal-panel">
<div class="panel-heading text-center custom-panel-heading">
<h2>Report</h2>
</div>
<div class="panel-body">
<div class="panel panel-default">
<div class="panel-heading">
<div class="panel-title">Info</div>
</div>
<div class="panel-body">
<table class="table table-bordered table-page-break-auto table-layout-fixed">
<tr>
<td class="col-sm-4">ID</td>
<td class="col-sm-8">1</td>
</tr>
</table>
</div>
</div>
</body>
</html>
<!--<?xml version = "1.0" encoding="Windows-1252" standalone="yes"?>
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
-->
需求是解析上面HTML中注释中的XML。 到目前为止,我已经尝试读取 HTML 文件并将其传递给字符串并执行以下操作:
with open('my_html.html', 'rb') as file:
d = str(file.read())
d2 = d[d.index('<!--') + 4:d.index('-->')]
d3 = "'''"+d2+"'''"
这是用 3 个单引号返回字符串 d3 中的 XML 数据片段。
然后尝试通过 Etree 阅读它:
ET.fromstring(d3)
但失败并出现以下错误:
xml.etree.ElementTree.ParseError:格式不正确(无效标记):第 1 行,第 2 列
基本上需要一些帮助:
首先,通过逐行阅读并使用
if string.startswith
过滤掉评论块来拆分您的 html 和 xml:
with open('xmlfile.xml') as fh:
html, xml = [], []
for line in fh:
# check for that comment line
if line.startswith('<!--'):
break
html.append(line)
# append current line
xml.append(line)
# keep iterating
for line in fh:
# check for ending block comment
if line.startswith('-->'):
break
xml.append(line)
# Get the root tag to close everything up
root_tag = xml[1].strip().strip('<>')
# add the closing tag and join, using the 4: slice to strip off block comment
xml = ''.join((*xml, f'</{root_tag}>'))[4:]
html = ''.join(html)
现在您应该能够使用您选择的解析器独立解析它们
你已经走上了正确的道路。我将您的 HTML 放入文件中,它工作正常,如下所示。
import xml.etree.ElementTree as ET
with open('extract_xml.html') as handle:
content = handle.read()
xml = content[content.index('<!--')+4: content.index('-->')]
document = ET.fromstring(xml)
for element in document.findall("./mytag"):
for child in element:
print(child, child.text)
如果你一次一行地阅读文件,你会发现这更容易管理。
import xml.etree.ElementTree as ET
START_COMMENT = '<!--'
END_COMMENT = '-->'
def getxml(filename):
with open(filename) as data:
lines = []
inxml = False
for line in data.readlines():
if inxml:
if line.startswith(END_COMMENT):
inxml = False
else:
lines.append(line)
elif line.startswith(START_COMMENT):
inxml = True
return ''.join(lines)
ET.fromstring(xml := getxml('/Volumes/G-Drive/foo.html'))
print(xml)
输出:
<ROOTTAG>
<mytag>
<headername>BASE</headername>
<fieldname>NAME</fieldname>
<val><![CDATA[Testcase]]></val>
</mytag>
<mytag>
<headername>BASE</headername>
<fieldname>AGE</fieldname>
<val><![CDATA[5]]></val>
</mytag>
</ROOTTAG>
随着
html.parser()
(Doc) 中的构建,您可以将 xml 注释作为字符串获取,您可以使用 xml.entree.ElementTree
进行解析:
from html.parser import HTMLParser
import xml.etree.ElementTree as ET
class MyHTMLParser(HTMLParser):
def handle_comment(self, data):
xml_str = data
tree = ET.fromstring(xml_str)
for elem in tree.iter():
print(elem.tag, elem.text)
parser = MyHTMLParser()
with open("your.html", "r") as f:
lines = f.readlines()
for line in lines:
parser.feed(line)
输出:
ROOTTAG
mytag
headername BASE
fieldname NAME
val Testcase
mytag
headername BASE
fieldname AGE
val 5