Beautifulsoup find_all()捕获了太多文本

问题描述 投票:1回答:3

我有一些HTML,正在使用BeautifulSoup软件包在Python中进行解析。这是HTML:

<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>

我正在使用此代码块捕获结果:

names = soup3.find_all('div', {'class': "n"}) 
contact = soup3.find_all('div', {'class': "x"})  
other = soup3.find_all('div', {'class': "x c"})  

现在,类'x'和'x c'都被捕获在'contact'变量中。如何防止这种情况发生?

python html web-scraping beautifulsoup
3个回答
2
投票

尝试:

soup.select('div[class="x"]')

输出:

[<div class="x">Address</div>, <div class="x">Phone</div>]

1
投票
from bs4 import BeautifulSoup

html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""

soup = BeautifulSoup(html, 'html.parser')

contact = soup.findAll("div", class_="x")[1]

print(contact)

输出:

<div class="x">Phone</div>

0
投票

如何使用集合?

others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others

others将为{<div class="x c">Other</div>}联系人将为{<div class="x">Phone</div>, <div class="x">Address</div>}

注意,这仅在特定的类情况下有效。通常,它可能无法正常工作,具体取决于HTML中具有的类的组合。

有关BeautifulSoup webscraping find_all( ): finding exact match的工作方式的更多详细信息,请参见.find_all()。>

© www.soinside.com 2019 - 2024. All rights reserved.