我有一个看起来像这样的文档
标题1
标头2
拜登总统周日发表讲话。
你好世界
我能为您提供什么帮助吗?
标题 3
我将其转换为 HTML,它看起来像这样
<html>
<body>
<h1>HEADER1</h1>
<ul>
<li>
the virus killed 56
</li>
<li>
Global press
<a href="https://www.example.com">
highlight
</a>
hundreds of dogs jumping
<ul>
<li>
A Twitter user
<a href="http://example.com/xad/status/sda">
posts
</a>
photos of cats
</li>
</ul>
</li>
</ul>
<h1>HEADER2</h1>
<ul>
<li>
President Biden talks on Sunday.
</li>
Hello World
<li>
How can I help you?
</li>
</ul>
<h1>HEADER3</h1>
<ul>
<li>
The war in Gaza continues
</li>
<li>
Global press highlights best pizza
<ul>
<li>
A Twitter user posts sushi
</li>
<li>
A Twitter user posts candy
</li>
</ul>
</li>
</ul>
</body>
</html>
由于项目符号嵌套在列表中的方式以及列表本身就是列表对象,我一生都无法弄清楚如何解析出外部 ul 标签中的每个单独的
我想要的最终结果看起来像这样
header1_posts = [
header2_posts = [...]
header3_posts = [...]
我尝试了 find、find_all、find_all_next 的所有组合并迭代不同的元素,但最终没有成功。
谢谢
你可以尝试:
from bs4 import BeautifulSoup
html_text = """\
<html>
<body>
<h1>HEADER1</h1>
<ul>
<li>
the virus killed 56
</li>
<li>
Global press
<a href="https://www.example.com">
highlight
</a>
hundreds of dogs jumping
<ul>
<li>
A Twitter user
<a href="http://example.com/xad/status/sda">
posts
</a>
photos of cats
</li>
</ul>
</li>
</ul>
<h1>HEADER2</h1>
<ul>
<li>
President Biden talks on Sunday.
</li>
Hello World
<li>
How can I help you?
</li>
</ul>
<h1>HEADER3</h1>
<ul>
<li>
The war in Gaza continues
</li>
<li>
Global press highlights best pizza
<ul>
<li>
A Twitter user posts sushi
</li>
<li>
A Twitter user posts candy
</li>
</ul>
</li>
</ul>
</body>
</html>"""
soup = BeautifulSoup(html_text, "html.parser")
def get_li_without_ul(li):
soup = BeautifulSoup(str(li), "html.parser")
for ul in soup.find_all("ul"):
ul.extract()
return soup
out = {}
for li in soup.find_all("li"):
header = li.find_previous("h1")
out.setdefault(header.text.strip(), []).append(get_li_without_ul(li))
print(out)
打印:
{'HEADER1': [<li>
the virus killed 56
</li>, <li>
Global press
<a href="https://www.example.com">
highlight
</a>
hundreds of dogs jumping
</li>, <li>
A Twitter user
<a href="http://example.com/xad/status/sda">
posts
</a>
photos of cats
</li>], 'HEADER2': [<li>
President Biden talks on Sunday.
</li>, <li>
How can I help you?
</li>], 'HEADER3': [<li>
The war in Gaza continues
</li>, <li>
Global press highlights best pizza
</li>, <li>
A Twitter user posts sushi
</li>, <li>
A Twitter user posts candy
</li>]}