措辞不佳的问题:IGNORE --- Web Scraping

问题描述 投票:-1回答:3

下面的代码获取this website上每个体育馆位置的URL,以“ AL,Albertville。”开头。

from urlparse import urljoin
import requests
import urllib3
from bs4 import BeautifulSoup

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']

for link in links:
    if any(keyword in link for keyword in keywords):
        print urljoin('https://www.planetfitness.com/', link)

前两个链接输出:

https://www.planetfitness.com/gyms/albertville-al
https://www.planetfitness.com/gyms/alexander-city-al

但是,我试图从每个链接输出中抓取以下内容:

  • 街道地址
  • 俱乐部时间

下面是我尝试完成街道地址部分的代码。但是,我不知道如何修复代码以使其实际执行此操作。

res1 = requests.get(urljoin('https://www.planetfitness.com/', link)).content
soup1 = BeautifulSoup(res1, 'html.parser')

ps = soup.find_all('p', {'class': 'address'})
address1 = [p.find('span')['itemprop'] for p in ps]

This image of when you inspect street address may help


谢谢您的帮助!

python web-scraping beautifulsoup python-requests href
3个回答
0
投票

这将获得该页面上的每个链接和地址。看起来,如果您想查找有关每个俱乐部的更多信息,则必须迭代遍历并加载每个页面。

from bs4 import BeautifulSoup
import requests

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

atags = soup.find_all('td', {'class':'club-title'})

links = [(atag.find('a')['href'], atag.find('p').text) for atag in atags)]


[print(link) for link in links]

0
投票

要尝试调试,请先打印出标签的值。您正在搜索类别为a且不存在的所有clubs-list标签。 a标记没有类,但是其父级td具有类club-title

您可以尝试这样的事情。

res = requests.get("https://www.planetfitness.com/sitemap").content
soup = BeautifulSoup(res, 'html.parser')

tds = soup.find_all('td', {'class': 'club-title'})
links = [td.find('a')['href'] for td in tds]
keywords = ['gyms']

for link in links:
    if any(keyword in link for keyword in keywords):
        print(link)

0
投票

您要选择包含td元素的club-title类的a元素,并提取href属性。


from bs4 import BeautifulSoup
from bs4 import Tag
import requests
import urllib
import time

sitemap = 'https://www.planetfitness.com/sitemap'
res = requests.get(sitemap).content
soup = BeautifulSoup(res, 'html.parser')

# The rows in the table of gyms are formatted like so:
# <tr>
# <td class="club-title"><a href="/gyms/albertville-al"><strong>Albertville, AL</strong> <p>5850 US Hwy 431</p></a></td>
# <td class="club-join"><div class="button"><a href="/gyms/albertville-al/offers" title="Join Albertville, AL">Join Now</a></div></td>
# </tr>

# This will find all the links to all the gyms.
atags = soup.select('td[class~=club-title] > a[href^="/gyms"]')
links = [atag.get('href') for atag in atags]

for link in links:
    # Follow the link to this gym
    gymurl = urllib.parse.urljoin(sitemap, link)
    print(gymurl)
    res = requests.get(gymurl).content
    soup = BeautifulSoup(res, 'html.parser')

    # Print the address of this gym.
    address_line1 = soup.select('p[class~=address] > span[class~=address-line1]')
    print('    address-line1 = ', address_line1[0].text)
    locality = soup.select('p[class~=address] > span[class~=locality]')
    print('    locality = ', locality[0].text)
    administrative_area = soup.select('p[class~=address] > span[class~=administrative-area]')
    print('    administrative-area = ', administrative_area[0].text)
    postal_code = soup.select('p[class~=address] > span[class~=postal-code]')
    print('    postal-code = ', postal_code[0].text)
    country = soup.select('p[class~=address] > span[class~=country]')
    print('    country = ', country[0].text)

    # Print the hours of this gym.
    strongs = soup.select('div > strong')
    for strong in strongs:
        if strong.text == 'Club Hours':
            for sibling in strong.next_siblings:
                if isinstance(sibling, Tag):
                    hours = sibling.text
                    print('    hours = ', hours.replace('<br>', '').replace('\n', ', '))
                    break

    time.sleep(3)

当我运行它时,我得到:

https://www.planetfitness.com/gyms/albertville-al
    address-line1 =  5850 US Hwy 431
    locality =  Albertville
    administrative-area =  AL
    postal-code =  35950
    country =  United States
    hours =  Monday-Friday 6am-9pm, Saturday-Sunday 7am-7pm
https://www.planetfitness.com/gyms/alexander-city-al
    address-line1 =  987 Market Place
    locality =  Alexander City
    administrative-area =  AL
    postal-code =  35010
    country =  United States
    hours =  Convenient hours when we reopen
https://www.planetfitness.com/gyms/bessemer-al
    address-line1 =  528 W Town Plaza
    locality =  Bessemer
    administrative-area =  AL
    postal-code =  35020
    country =  United States
    hours =  Convenient hours when we reopen

...

© www.soinside.com 2019 - 2024. All rights reserved.