Python:删除两个给定 HTML 标签之间的所有内容

问题描述 投票:0回答:1

选择的编程语言是Python。 这是我的 HTML 文本:

<a href="https://www.example.com">Original message</a><br>
<ul id="list">
    <li class="blockbody" id="post_1">
        <div class="header">
            <div class="datetime">
                24 januari 2020, 11:34
            </div><span class="name">Jane Doe</span>
        </div>
        <div class="content">
            <blockquote class="restore">
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        <div>
                            Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                        </div>
                        <div class="message">
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                        </div>
                        <hr>
                    </div>
                </div><br>
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                        <br>
                        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                        <br>
                        quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                        <hr>
                    </div>
                </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                <br>
                velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
            </blockquote>
        </div>
    </li>
</ul>

我想删除 THE FIRST

<div class="bbcode_quote printable">
THE LAST
<hr>
标签之间的所有内容。正如您所看到的,这两个标签都有多个实例,这就是为什么我强调第一个和最后一个。 我熟悉 Python,但字符串操作不是我的专业领域。 我希望我已经说清楚了。

python html text tags
1个回答
0
投票

您可以使用 Beautiful Soup 库 - https://pypi.org/project/beautifulsoup4/

参考:


from bs4 import BeautifulSoup
# The HTML string provided
html = """
<a href="https://www.example.com">Original message</a><br>
<ul id="list">
    <li class="blockbody" id="post_1">
        <div class="header">
            <div class="datetime">
                24 januari 2020, 11:34
            </div><span class="name">Jane Doe</span>
        </div>
        <div class="content">
            <blockquote class="restore">
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        <div>
                            Citation: John Doe <a href="showthread.php#post684209" rel="nofollow"><img alt="" class="inlineimg" src="image/style/Aesthetica/button/view.gif"></a>
                        </div>
                        <div class="message">
                            Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
                        </div>
                        <hr>
                    </div>
                </div><br>
                <div class="bbcode_container">
                    <i class="fa fa-envelope"></i> Citation:
                    <div class="bbcode_quote printable">
                        <hr>
                        Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?<br>
                        <br>
                        Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam.<br>
                        <br>
                        quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum..
                        <hr>
                    </div>
                </div>... <a href="https://example.com" target="_blank">https://example.html</a><br>
                <br>
                velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum?
            </blockquote>
        </div>
    </li>
</ul>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find the first div with class "bbcode_quote printable"
first_quote = soup.find('div', class_='bbcode_quote printable')

# Find all hr tags in the first div, get the last hr tag
last_hr = first_quote.find_all('hr')[-1]

# Extract all contents between the first hr and the last hr
start = first_quote.hr
end = last_hr

# Extract all elements between these two tags
current = start.find_next_sibling()
while current and current != end:
    current.extract()
    current = start.find_next_sibling()

# This will remove the first <hr> also, if you want to keep it uncomment below line
# last_hr.extract()  # Optionally remove the last <hr> as well, depending on requirements

# Print the modified HTML
print(str(soup))
© www.soinside.com 2019 - 2024. All rights reserved.