我想获得两个标签之间的所有文本:
<div class="lead">I DONT WANT this</div>
#many different tags - p, table, h2 including text that I want
<div class="image">...</div>
我这样开始:
url = "http://......."
req = urllib.request.Request(url)
source = urllib.request.urlopen(req)
soup = BeautifulSoup(source, 'lxml')
start = soup.find('div', {'class': 'lead'})
end = soup.find('div', {'class': 'image'})
我不知道接下来该做什么
尝试使用以下代码:
from bs4 import BeautifulSoup
soup = BeautifulSoup("""
<html><div class="lead">lead</div>data<div class="end"></div></html>"
""", "lxml")
node = soup.find('div', {'class': 'lead'})
s = []
while True:
if node is None:
break
node = node.next_sibling
if hasattr(node, "attrs") and ("end" in node.attrs['class'] ):
break
else:
if node is not None:
s.append(node)
print s
使用next_sibling获取兄弟节点。
尝试使用此代码,它允许解析器从类引导开始并在命中类图像时退出程序并打印所有可用标记,这可以更改为打印整个代码:
html = u""
for tag in soup.find("div", { "class" : "lead" }).next_siblings:
if soup.find("div", { "class" : "image" }) == tag:
break
else:
html += unicode(tag)
print html