Scrapy-每个页面都会被刮擦，但是scrapy会缠绕并刮擦前x页的页面数量

Question

class HomedepotcrawlSpider(CrawlSpider):

    name = 'homeDepotCrawl'
    #allowed_domains = ['homedepot.com']
    start_urls =['https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=0']

    def parse(self, response):

        for item in self.parseHomeDepot(response):
            yield item

        next_page_url = response.xpath('//link[@rel="next"]/@href').extract_first()
        if next_page_url:
            yield response.follow(url=next_page_url, callback=self.parse)



    def parseHomeDepot(self, response):

        items = response.css('.plp-pod')
        for product in items:
            item = HomedepotSpiderItem()

    #get SKU
            productSKU = product.css('.pod-plp__model::text').getall()

    #get rid of all the stuff i dont need
            productSKU = [x.strip(' ') for x in productSKU] #whiteSpace
            productSKU = [x.strip('\n') for x in productSKU]
            productSKU = [x.strip('\t') for x in productSKU]
            productSKU = [x.strip(' Model# ') for x in productSKU] #gets rid of the model name
            productSKU = [x.strip('\xa0') for x in productSKU] #gets rid of the model name



            item['productSKU'] = productSKU

            yield item

问题的解释

这里是我一直在抓取数据的程序的一部分。我遗漏了我的代码来抓取其他字段，因为我认为没有必要在这篇文章中添加。当我运行该程序并将数据导出到excel时，我得到了前240个项目（10页）。到电子表格的第241行（第一行被标签占据）。然后从第242行开始，再次重复前241行。然后再次在第482和722行。

Scraper输出前240个项目3次

编辑

因此，我在抓取期间查看了日志，结果发现每个页面都被抓取了。最后一页是：

[C0

然后，日志文件显示再次被抓取的第一页，这是：

https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=696&Ns=None

我想是因为https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default

我用于导出到excel的终端命令是：

Edit：之所以使用此命令，是因为在导出时，Scrapy会将抓取的数据附加到文件中，因此这会擦除目标文件并再次创建它。

我用来获取所有页面的代码是：

scrapy crawl homeDepotCrawl -t csv -o - > "(File Location)"

最初我以为是导致此意外行为的网站，所以在settings.py上，我更改了<a class="hd-pagination__link" title="Next" href="/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=24&Ns=None" data-pagenumber="2"></a>并添加了一个延迟，但没有任何改变。

所以我想提供什么帮助：

-弄清楚为什么CSV输出仅占用前240个项目（10页）并重复3次

-如何在抓取前30个后确保蜘蛛不返回首页

Answer 1

您的确是从头开始，chrome开发工具显示到达终点时，“下一个”指向第一组项目。

您可以通过查看当前项目索引的逻辑来检测和避免这种情况：

ROBOTSTXT_OBEY = 0

并编辑>>> from urllib.parse import urlparse, parse_qs >>> url = 'https://www.homedepot.com/b/Appliances/ZLINE-Kitchen-and-Bath/N-5yc1vZbv1wZhsy?experienceName=default&Nao=696&Ns=None' >>> parsed = urlparse(url) >>> page_index = int(parse_qs(parsed.query)['Nao'][0]) >>> page_index 696逻辑以包含类似if next_page_url的逻辑>

Answer 2

我建议做这样的事情。主要区别是我从存储在页面上的json中获取信息，并且通过识别and page_index > last_page_index是乘积偏移量对自己进行了分页。代码也短得多：

Scrapy-每个页面都会被刮擦，但是scrapy会缠绕并刮擦前x页的页面数量

问题描述投票：3回答：2

2个回答

最新问题

Scrapy-每个页面都会被刮擦，但是scrapy会缠绕并刮擦前x页的页面数量

问题描述 投票：3回答：2

2个回答

最新问题

问题描述投票：3回答：2