我想要其父级li标签类为'info-wrap'或'info-wrap no-meta'而不是'info-wrap hide'的div类'hide info-json'的内容。
HTML示例:
<li class="info-wrap">
<div class="hide info-json">
<p>Content That I Want - JSON Data </p>
</div>
</li>
<li class="info-wrap hide">
<div class="hide info-json">
<p>Content That I Don't Want </p>
</div>
</li>
<li class="info-wrap no-meta">
<div class="hide info-json">
<p>Content That I Want - JSON Data </p>
</div>
</li>
这是我的代码:
soup = BeautifulSoup(res.text, "lxml")
for divTags in soup.findAll('li', class_ = re.compile('^(?!.*hide).*info-wrap.*$')):
for infoList in divTags.find_all('div',{'class':'hide info-json'}):
Curinfo = json.loads(infoList.text)
但不返回任何内容。
如果我在https://regex101.com/r/8yJ5yI/1上检查此正则表达式,则工作正常。请帮我怎么做。
对我来说,使用正则表达式不是强制性的,我只需要<p>Content That I Want </p>
谢谢
import re
html = """<li class="info-wrap">
<div class="hide info-json">
<p>Content That I Want - JSON Data </p>
</div>
</li>
<li class="info-wrap hide">
<div class="hide info-json">
<p>Content That I Don't Want </p>
</div>
</li>
<li class="info-wrap no-meta">
<div class="hide info-json">
<p>Content That I Want - JSON Data </p>
</div>
</li>"""
l = re.findall(r"""<li\s+class="info-wrap(\s+no-meta)?"\s*>\s*
<div\s+class="hide\s+info-json"\s*>
\s*(.*?)\s*
</div>\s*
</li>
""",html, flags=re.VERBOSE|re.IGNORECASE|re.DOTALL)
l = [item[1] for item in l]
print(l)
使用:not(bs4 4.7.1+)过滤掉不需要的类