Python BeautifulSoup正则表达式过滤器不起作用

问题描述 投票:0回答:2

我想要其父级li标签类为'info-wrap'或'info-wrap no-meta'而不是'info-wrap hide'的div类'hide info-json'的内容。

HTML示例:

<li class="info-wrap">
    <div class="hide info-json">
        <p>Content That I Want - JSON Data </p>
    </div>
</li>

<li class="info-wrap hide">
    <div class="hide info-json">
        <p>Content That I Don't Want </p>
    </div>
</li>

<li class="info-wrap no-meta">
    <div class="hide info-json">
        <p>Content That I Want - JSON Data  </p>
    </div>
</li>

这是我的代码:

soup = BeautifulSoup(res.text, "lxml")        
        for divTags in soup.findAll('li', class_ = re.compile('^(?!.*hide).*info-wrap.*$')):
            for infoList in divTags.find_all('div',{'class':'hide info-json'}):
                Curinfo = json.loads(infoList.text)  

但不返回任何内容。

如果我在https://regex101.com/r/8yJ5yI/1上检查此正则表达式,则工作正常。请帮我怎么做。

对我来说,使用正则表达式不是强制性的,我只需要<p>Content That I Want </p>

谢谢

python-3.x beautifulsoup findall
2个回答
0
投票
import re

html = """<li class="info-wrap">
    <div class="hide info-json">
        <p>Content That I Want - JSON Data </p>
    </div>
</li>

<li class="info-wrap hide">
    <div class="hide info-json">
        <p>Content That I Don't Want </p>
    </div>
</li>

<li class="info-wrap no-meta">
    <div class="hide info-json">
        <p>Content That I Want - JSON Data  </p>
    </div>
</li>"""

l = re.findall(r"""<li\s+class="info-wrap(\s+no-meta)?"\s*>\s*
               <div\s+class="hide\s+info-json"\s*>
               \s*(.*?)\s*
               </div>\s*
               </li>
               """,html, flags=re.VERBOSE|re.IGNORECASE|re.DOTALL)
l = [item[1] for item in l]
print(l)

0
投票

使用:not(bs4 4.7.1+)过滤掉不需要的类

© www.soinside.com 2019 - 2024. All rights reserved.