下面的代码获取this website上每个体育馆位置的URL,以“ AL,Albertville。”开头。
from urlparse import urljoin
import requests
import urllib3
from bs4 import BeautifulSoup
res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')
tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']
for link in links:
if any(keyword in link for keyword in keywords):
print urljoin('https://www.planetfitness.com/', link)
前两个链接输出:
https://www.planetfitness.com/gyms/albertville-al
https://www.planetfitness.com/gyms/alexander-city-al
但是,我试图从每个链接输出中抓取以下内容:
下面是我尝试完成街道地址部分的代码。但是,我不知道如何修复代码以使其实际执行此操作。
res1 = requests.get(urljoin('https://www.planetfitness.com/', link)).content
soup1 = BeautifulSoup(res1, 'html.parser')
ps = soup.find_all('p', {'class': 'address'})
address1 = [p.find('span')['itemprop'] for p in ps]
谢谢您的帮助!
要尝试调试,请先打印出标签的值。您正在搜索类别为a
且不存在的所有clubs-list
标签。 a
标记没有类,但是其父级td
具有类club-title
。
您可以尝试这样的事情。
res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')
tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']
for link in links:
if any(keyword in link for keyword in keywords):
print(link)
这将获得该页面上的每个链接和地址。看起来,如果您想查找有关每个俱乐部的更多信息,则必须迭代遍历并加载每个页面。
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')
atags = soup.find_all('td', {'class':'club-title'})
links = [(atag.find('a')['href'], atag.find('p').text) for atag in atags)]
[print(link) for link in links]
您要选择包含td
元素的club-title
类的a
元素,并提取href
属性。
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')
# The rows in the table of gyms are formatted like so:
# <tr>
# <td class="club-title"><a href="/gyms/albertville-al"><strong>Albertville, AL</strong> <p>5850 US Hwy 431</p></a></td>
# <td class="club-join"><div class="button"><a href="/gyms/albertville-al/offers" title="Join Albertville, AL">Join Now</a></div></td>
# </tr>
# This will find all the links to all the gyms.
atags = soup.select('td[class~=club-title] > a[href^="/gyms"]')
links = [atag.get('href') for atag in atags]
for link in links:
print(link)