给出python中的链接列表,我如何通过Web刮取每个链接上的街道地址?

问题描述 投票:-1回答:3

下面的代码获取this website上每个体育馆位置的URL,以“ AL,Albertville。”开头。

from urlparse import urljoin
import requests
import urllib3
from bs4 import BeautifulSoup

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']

for link in links:
    if any(keyword in link for keyword in keywords):
        print urljoin('https://www.planetfitness.com/', link)

前两个链接输出:

https://www.planetfitness.com/gyms/albertville-al
https://www.planetfitness.com/gyms/alexander-city-al

但是,我试图从每个链接输出中抓取以下内容:

  • 街道地址
  • 俱乐部时间

下面是我尝试完成街道地址部分的代码。但是,我不知道如何修复代码以使其实际执行此操作。

res1 = requests.get(urljoin('https://www.planetfitness.com/', link)).content
soup1 = BeautifulSoup(res1, 'html.parser')

ps = soup.find_all('p', {'class': 'address'})
address1 = [p.find('span')['itemprop'] for p in ps]

谢谢您的帮助!

python web-scraping beautifulsoup python-requests href
3个回答
0
投票

要尝试调试,请先打印出标签的值。您正在搜索类别为a且不存在的所有clubs-list标签。 a标记没有类,但是其父级td具有类club-title

您可以尝试这样的事情。

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']

for link in links:
    if any(keyword in link for keyword in keywords):
        print(link)

0
投票

这将获得该页面上的每个链接和地址。看起来,如果您想查找有关每个俱乐部的更多信息,则必须迭代遍历并加载每个页面。

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

atags = soup.find_all('td', {'class':'club-title'})

links = [(atag.find('a')['href'], atag.find('p').text) for atag in atags)]


[print(link) for link in links]

0
投票

您要选择包含td元素的club-title类的a元素,并提取href属性。

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

# The rows in the table of gyms are formatted like so:
# <tr>
# <td class="club-title"><a href="/gyms/albertville-al"><strong>Albertville, AL</strong> <p>5850 US Hwy 431</p></a></td>
# <td class="club-join"><div class="button"><a href="/gyms/albertville-al/offers" title="Join Albertville, AL">Join Now</a></div></td>
# </tr>

# This will find all the links to all the gyms.
atags = soup.select('td[class~=club-title] > a[href^="/gyms"]')
links = [atag.get('href') for atag in atags]

for link in links:
    print(link)
© www.soinside.com 2019 - 2024. All rights reserved.