我有一些HTML,正在使用BeautifulSoup软件包在Python中进行解析。这是HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
我正在使用此代码块捕获结果:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
现在,类'x'和'x c'都被捕获在'contact'变量中。如何防止这种情况发生?
尝试:
soup.select('div[class="x"]')
输出:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
输出:
<div class="x">Phone</div>
如何使用集合?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others将为{<div class="x c">Other</div>}
和联系人将为{<div class="x">Phone</div>, <div class="x">Address</div>}
注意,这仅在特定的类情况下有效。通常,它可能无法正常工作,具体取决于HTML中具有的类的组合。
有关BeautifulSoup webscraping find_all( ): finding exact match的工作方式的更多详细信息,请参见.find_all()
。>