我有很多 html 页面,它们以某种方式嵌入了多个换行符,标签位于不同的行上,并且一些句子以明显随机的间隔分开。这是我正在处理的一个例子:
<html>
<head>
<title>One of many</title>
</head>
<body>
<h1>
Spam is not ham
</h1>
<p>
Many plates of Spam
</p>
<p>
Use the Fry option to properly cook the
Spam
until done.
</p>
<p>
Enquiries for more recipes can be made through the
Feed Me
option.
</p>
</body>
</html>
我使用了 Replace() 函数,并通过以下代码对开始标签取得了部分成功:
html_filename = 'page.htm'
f = open(html_filename, encoding="utf-8")
file_str = f.readlines()
f.close()
with open(html_filename, 'w', encoding="utf-8") as f:
for line in file_str:
if '<h1>\n' in line:
tmp = line.replace('<h1>\n', '<h1>')
f.write(tmp)
elif '<p>\n' in line:
tmp = line.replace('<p>\n', '<p>')
f.write(tmp)
else:
f.write(line)
并得到以下结果:
<html>
<head>
<title>One of many</title>
</head>
<body>
<h1>Spam is not ham
</h1>
<p>Many plates of Spam
</p>
<p>Use the Fry option to properly cook the
Spam
until done.
</p>
<p>Enquiries for more recipes can be made through the
Feed Me
option.
</p>
</body>
</html>
但是,我不知道如何解析仅包含文本的行或仅包含结束标记的行。
未说明所需的输出。看起来主要的挑战是合并不包含任何 XML/HTML 标签的行。
您可以使用标准/内置 xml.dom.minidom 来解析和美化您的数据。
import xml.dom.minidom as MD
with open("foo.html", "r+", -1, "utf-8") as data:
lines = []
for line in map(str.strip, data):
if line.startswith("<") or lines[-1].startswith("<"):
lines.append(line)
else:
lines[-1] += " " + line
pretty = MD.parseString("".join(lines)).toprettyxml(" ")
start = pretty.index("<html>")
data.seek(0)
print(pretty[start:], file=data, end="")
data.truncate()
输出:
<html>
<head>
<title>One of many</title>
</head>
<body>
<h1>Spam is not ham</h1>
<p>Many plates of Spam</p>
<p>Use the Fry option to properly cook the Spam until done.</p>
<p>Enquiries for more recipes can be made through the Feed Me option.</p>
</body>
</html>