替换或删除HTML标记和内容Python正则表达式

Question

我想删除HTML打开和关闭以及两个标记之间的内容与正则表达式。如何删除以下字符串中的<head>标记。

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

所以它看起来像这样：

my_string = '''
<html>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

Answer 1

您可以使用head函数在Python中使用Beautiful Soup从HTML文本中删除decompose()标记。试试这个Python代码，

from bs4 import BeautifulSoup

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

soup = BeautifulSoup(my_string)
soup.find('head').decompose()  # find head tag and decompose/destroy it from the html
print(soup)                    # prints html text without head tag

打印，

<html>

<meta/>
<p>
        this is a different paragraph tag
        </p>
</html>

此外，虽然不建议使用正则表达式，但如果要删除的标记不是嵌套的，则可以使用这些Python代码中的注释中提到的正则表达式将其删除。但总是避免使用正则表达式来解析嵌套结构并寻找合适的解析器。

import re

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

print(re.sub(r'(?s)<head>.*?</head>', '', my_string))

打印以下内容并注意使用(?s)，这是启用点匹配换行符所需的，因为您的html分布在多行中，

<html>

    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>

替换或删除HTML标记和内容Python正则表达式

问题描述投票：0回答：1

1个回答

最新问题

替换或删除HTML标记和内容Python正则表达式

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1