BeautifulSoup - 如何获取两个不同标签之间的所有文本?

问题描述 投票:5回答:2

我想获得两个标签之间的所有文本:

<div class="lead">I DONT WANT this</div>

#many different tags - p, table, h2 including text that I want

<div class="image">...</div>

我这样开始:

url = "http://......."
req = urllib.request.Request(url)
source = urllib.request.urlopen(req)
soup = BeautifulSoup(source, 'lxml')

start = soup.find('div', {'class': 'lead'})
end = soup.find('div', {'class': 'image'})

我不知道接下来该做什么

python beautifulsoup
2个回答
0
投票

尝试使用以下代码:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
    <html><div class="lead">lead</div>data<div class="end"></div></html>"
    """, "lxml")

node = soup.find('div', {'class': 'lead'})
s = []
while True:
    if node is None:
        break
    node = node.next_sibling
    if hasattr(node, "attrs") and ("end" in node.attrs['class'] ):
        break   
    else:
        if node is not None:
            s.append(node)
print s

使用next_sibling获取兄弟节点。


0
投票

尝试使用此代码,它允许解析器从类引导开始并在命中类图像时退出程序并打印所有可用标记,这可以更改为打印整个代码:

html = u""
for tag in soup.find("div", { "class" : "lead" }).next_siblings:
    if soup.find("div", { "class" : "image" }) == tag:
        break
    else:
        html += unicode(tag)
print html
© www.soinside.com 2019 - 2024. All rights reserved.