Python 中的 WebSraping 使用 Books to Scrape 网站

问题描述 投票:0回答:1

我对我的代码有疑问,因为它不起作用。 Visual Studio 在类别变量上显示 AttributeError - Traceback(最近一次调用): 文件“”,第 11 行,位于 AttributeError:“NoneType”对象没有属性“find_all”

我无法弄清楚,问题出在哪里。我被困住了,所以如果有人知道我在哪里犯了错误,我会很高兴。这是代码:

from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://books.toscrape.com/'
response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

books_data = []
for page_num in range(1,51):
    url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')

    books = soup.find_all('h3')

    for book in books:
        book_url = book.find('a')['href']
        book_response = requests.get(url + book_url)
        book_soup = BeautifulSoup(book_response.content, 'html.parser')
        
        title = book_soup.find('h1').text
        category = book_soup.find('ul', class_ = 'breadcrumb').find_all('a')[2].text.strip()
        rating = book_soup.find('p', class_ = 'star-rating')['class'][1]
        price = book_soup.find('p', class_ = 'price_color').text.strip()
        availibility = book_soup.find('p', class_ = 'availibility').text.strip()

        books_data = ([title, category, rating, price, availibility])
print(books_data)
python web-scraping attributeerror find-all-references
1个回答
0
投票

book_url
阵型存在问题。您正在组合分页页面 url 和图书 href 值,因此最终的
book_url
是无效的 URL。

book_response
仅包含带有h1标签的
404 Not Found
。还有一些其他问题,例如未将数据附加到
books_data

检查下面修改后的代码

from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd

books_data = []
for page_num in range(1,51):
    page_url = f'https://books.toscrape.com/catalogue/page-{page_num}.html'
    response = requests.get(page_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    books = soup.find_all('h3')

    for book in books:
        book_href = book.find('a')['href']
        book_url = f"https://books.toscrape.com/catalogue/{book_href}"
        book_response = requests.get(book_url)
        book_soup = BeautifulSoup(book_response.content, 'html.parser')
        
        title = book_soup.find('h1').text
        category = book_soup.find('ul', class_ = 'breadcrumb').find_all('a')[2].text.strip()
        rating = book_soup.find('p', class_ = 'star-rating')['class'][1]
        price = book_soup.find('p', class_ = 'price_color').text.strip()
        availibility = book_soup.find_all('p', class_ = ['instock', 'availibility'])[0].text.strip()

        data = [title, category, rating, price, availibility]
        books_data.append(data)

print(books_data)

输出:

[['A Light in the Attic', 'Poetry', 'Three', '£51.77', 'In stock (22 available)'],
['Tipping the Velvet', 'Historical Fiction', 'One', '£53.74', 'In stock (20 available)'],
['Soumission', 'Fiction', 'One', '£50.10', 'In stock (20 available)'],
['Sharp Objects', 'Mystery', 'Four', '£47.82', 'In stock (20 available)'],
.
.
.
['Mesaerion: The Best Science Fiction Stories 1800-1849', 'Science Fiction', 'One', '£37.59', 'In stock (19 available)'],
['Libertarianism for Beginners', 'Politics', 'Two', '£51.33', 'In stock (19 available)'],
["It's Only the Himalayas", 'Travel', 'Two', '£45.17', 'In stock (19 available)']]
© www.soinside.com 2019 - 2024. All rights reserved.