如何使用 Python 清理带有多个不需要的换行符的 html 代码?

问题描述 投票:0回答:1

我有很多 html 页面,它们以某种方式嵌入了多个换行符,标签位于不同的行上,并且一些句子以明显随机的间隔分开。这是我正在处理的一个例子:

<html>
    <head>
        <title>One of many</title>
    </head>
<body>
    <h1>
Spam is not ham
</h1>
    <p>
Many plates of Spam
</p>
    <p>
Use the Fry option to properly cook the
Spam
until done.
</p>
    <p>
Enquiries for more recipes can be made through the
Feed Me
option.
</p>
  </body>
</html>

我使用了 Replace() 函数,并通过以下代码对开始标签取得了部分成功:

html_filename = 'page.htm'

f = open(html_filename, encoding="utf-8")
file_str = f.readlines()
f.close()

with open(html_filename, 'w', encoding="utf-8") as f:
    for line in file_str:
        if '<h1>\n' in line:
            tmp = line.replace('<h1>\n', '<h1>')
            f.write(tmp)
        elif '<p>\n' in line:
            tmp = line.replace('<p>\n', '<p>')
            f.write(tmp)
        else:
            f.write(line)

并得到以下结果:

<html>
    <head>
        <title>One of many</title>
    </head>
<body>
    <h1>Spam is not ham
</h1>
    <p>Many plates of Spam
</p>
    <p>Use the Fry option to properly cook the
Spam
until done.
</p>
    <p>Enquiries for more recipes can be made through the
Feed Me
option.
</p>
  </body>
</html>

但是,我不知道如何解析仅包含文本的行或仅包含结束标记的行。

python html
1个回答
0
投票

未说明所需的输出。看起来主要的挑战是合并不包含任何 XML/HTML 标签的行。

您可以使用标准/内置 xml.dom.minidom 来解析和美化您的数据。

import xml.dom.minidom as MD

with open("foo.html", "r+", -1, "utf-8") as data:
    lines = []
    for line in map(str.strip, data):
        if line.startswith("<") or lines[-1].startswith("<"):
            lines.append(line)
        else:
            lines[-1] += " " + line
    pretty = MD.parseString("".join(lines)).toprettyxml("  ")
    start = pretty.index("<html>")
    data.seek(0)
    print(pretty[start:], file=data, end="")
    data.truncate()

输出:

<html>
  <head>
    <title>One of many</title>
  </head>
  <body>
    <h1>Spam is not ham</h1>
    <p>Many plates of Spam</p>
    <p>Use the Fry option to properly cook the Spam until done.</p>
    <p>Enquiries for more recipes can be made through the Feed Me option.</p>
  </body>
</html>
© www.soinside.com 2019 - 2024. All rights reserved.