为什么我的scrapy分页不使用计数器工作?

问题描述 投票:0回答:1

代码无法运行。我正在尝试从不同品牌(类别)中抓取多个页面。我正在使用的页面有所有品牌的列表。品牌按组排列,并按该品牌的首字母进行分类。在这个品牌的页面里面,有多个页面有不同的产品。

尝试编写一个使用计数器获取品牌的代码,如果这个特定的首字母组中没有品牌,则它会转到下一个组。 (请求没问题,问题出在代码中。抓取本身正在工作,只有当我尝试此分页时代码才会失败)。

import scrapy
from scrapy import Request

class MlSpider(scrapy.Spider):
    name = "ml"

    def start_requests(self):
        yield scrapy.Request('https://lista.mercadolivre.com.br/produtos-cabelo')

    def parse(self, response, **kwargs):
        cgroup = 1
        cbrand = 1
        num_group = response.xpath(f'//div[@class="ui-search-search-modal-filter-group"][{cgroup}]').get()
        for m in num_group:
            link_marca = m.xpath(f'.//a[@class="ui-search-search-modal-filter ui-search-link"][{cbrand}]/@href').get()
            if link_marca:
                yield scrapy.Request(url=link_marca)
                for i in response.xpath('.//div[@class="ui-search-result__content"]'):
                    marca = i.xpath('.//span[@class="ui-search-item__brand-discoverability ui-search-item__group__element"]/text()').get()
                    title = i.xpath('.//h2/text()').get()
                    real = i.xpath('.//span[@class="andes-money-amount ui-search-price__part ui-search-price__part--medium andes-money-amount--cents-superscript"]//span[@class="andes-money-amount__fraction"]/text()').get()
                    centavo = i.xpath('//span[@class="andes-money-amount ui-search-price__part ui-search-price__part--medium andes-money-amount--cents-superscript"]//span[@class="andes-money-amount__cents andes-money-amount__cents--superscript-24"]/text()').get()
                    value = f'R$ {real},{centavo}'
                    link = i.xpath('.//a/@href').get()

                    yield {
                        'marca': marca,
                        'title': title,
                        'value': value,
                        'link': link
                    }

                next_page = response.xpath('//a[contains(@title,"Seguinte")]/@href').get()
                if next_page:
                    yield scrapy.Request(url=next_page, callback=self.parse)

                cbrand += 1

            else:
                cgroup += 1
python csv web-scraping scrapy pycharm
1个回答
0
投票

由于您放置了下一页逻辑,因此分页不起作用。我已经编辑了您的代码,使其从品牌页面开始,然后转到每个品牌,获取产品详细信息,如果有下一页,它将转到下一页并抓取该页面上的产品。我还编辑了您的一些选择器,如下所示:

import scrapy


class ProductsSpider(scrapy.Spider):
    name = "products"
    allowed_domains = ["lista.mercadolivre.com.br"]
    start_urls = [
        "https://lista.mercadolivre.com.br/produtos-cabelo_FiltersAvailableSidebar?filter=BRAND"
    ]

    def parse(self, response):
        brand_links = response.xpath("//div[@class='ui-search-search-modal-grid-columns']/a/@href").getall()

        for link in brand_links:
            yield scrapy.Request(link, callback=self.parse_products)

    def parse_products(self, response):
        for i in response.xpath('.//div[@class="ui-search-result__content"]'):
            marca = i.xpath('.//span[contains(@class, "ui-search-item__brand-discoverability")]/text()').get()
            title = i.xpath(".//h2/text()").get()
            real = i.xpath('.//span[@class="andes-money-amount__fraction"]/text()').get()
            centavo = i.xpath('.//span[contains(@class, "andes-money-amount__cents")]/text()').get()
            value = f"R$ {real},{centavo}"
            link = i.xpath(".//a/@href").get()

            yield {
                "marca": marca,
                "title": title,
                "value": value,
                "link": link,
            }

        next_page = response.xpath('//a[contains(@title,"Seguinte")]/@href').get()
        if next_page:
            yield scrapy.Request(url=next_page, callback=self.parse_products)
© www.soinside.com 2019 - 2024. All rights reserved.