如何使用Scrapy使用Ajax Infinite Scroll刮擦网站

问题描述 投票:0回答:1

我是Scrapy的新手,我想刮擦一家汽车经销店的网站。该网站使用ajax无限滚动。我在线上找到了一些无限滚动示例,但它们都从json文件中抓取数据,但在这种情况下,网站未使用json(我可能是错的)

我只能从?page = 1抓取标题,但直到?page = 8才可以抓取标题,并且页数可以根据库存的车辆数量而变化。该站点在

<ul class="pagination pagination-sm" data-url="https://www.marlboroughford.com/inventory?page=d">                                                   <li id="il-pagination-element-1" class="">
<a href="https://www.marlboroughford.com/inventory">1</a>
</li>
<li id="il-pagination-element-2" class="">
<a href="https://www.marlboroughford.com/inventory?page=2">2</a>
</li>
<li id="il-pagination-element-3" class="">
<a href="https://www.marlboroughford.com/inventory?page=3">3</a>
</li>
<li><span>...</span></li>
<li id="il-pagination-element-8">
<a href="https://www.marlboroughford.com/inventory?page=8">8</a>
</li>
</ul>
import scrapy

class DealerSpider(scrapy.Spider):
    name = "cars"
    start_urls = [
        'https://www.marlboroughford.com/inventory?page=',
    ]

    def parse(self, response):         
        yield {
            'title': response.xpath('/html/body/div[1]/main/div/div/div/div/div/div/div/div/div/div/meta[1]/@content').extract()
        }
python web-scraping scrapy infinite-scroll
1个回答
0
投票

这对我有用。

class AmazonSpider(scrapy.Spider):
    name = 'limit'
    rotate_user_agent = True
    web_url = ['https://www.marlboroughford.com/inventory?page=']

    COUNT_MAX = 9


    def start_requests(self):
        for i in range(self.COUNT_MAX):
            yield scrapy.Request('https://www.marlboroughford.com/inventory?page=%d' %i, callback=self.parse1)

    def parse1(self, response):
        print(response.request.headers['User-Agent']) # I have set random user agent
        print(response.status)
        print(response.url)
        my_headers = {
            'Referer': response.url
        }
        yield {
            'title': response.xpath(
                '/html/body/div[1]/main/div/div/div/div/div/div/div/div/div/div/meta[1]/@content').extract()
        }
        yield scrapy.Request(
            "https://api.autofi.com/v1/vehicle-service/vehicles/search",
            headers=my_headers,

        )






# Output

2020-06-14 14:02:00 [scrapy.core.engine] INFO: Spider opened
2020-06-14 14:02:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-06-14 14:02:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-06-14 14:02:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/robots.txt> (referer: None)
2020-06-14 14:02:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=0> (referer: None)
b'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3'
200
https://www.marlboroughford.com/inventory?page=0
2020-06-14 14:02:09 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=0>
{'title': ['2013 Ford Fusion SE  - Bluetooth -  SYNC -  SiriusXM', '2010 Jeep Grand Cherokee Limited  - Sunroof -  Navigation', '2013 Ford Escape SE', '2014 Ford Escape SE  - Bluetooth -  Heated Seats', '2012 Ford Explorer Base  -  Power Windows', '2014 Ford Escape SE  - Bluetooth -  Heated Seats', '2013 Ford Escape SEL  - Leather Seats -  Bluetooth', '2015 Ford Escape SE  - Bluetooth -  Heated Seats', '2015 Dodge Journey R/T', '2017 Ford Escape S  - Certified - Bluetooth', '2016 Ford C-Max SEL  - Navigation -  Leather Seats', '2016 Ford Escape SE  - Certified', '2014 Ford F-150 XLT', '2016 Ford Escape Titanium  - Certified -  SiriusXM', '2017 Ford Escape SE', '2017 Ford Escape SE  - One owner - Certified', '2019 Ford EcoSport SE 4WD', '2017 Ford Escape SE  - Certified - Low Mileage', '2016 Ford F-150 XLT  - One owner - Certified', '2012 Chevrolet Silverado 1500 LTZ  - Leather Seats']}
2020-06-14 14:02:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://api.autofi.com/robots.txt> (referer: None)
2020-06-14 14:02:10 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET https://api.autofi.com/v1/vehicle-service/vehicles/search>
2020-06-14 14:02:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=1> (referer: None)
b'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3'
200
https://www.marlboroughford.com/inventory?page=1
2020-06-14 14:02:13 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=1>
{'title': ['2013 Ford Fusion SE  - Bluetooth -  SYNC -  SiriusXM', '2010 Jeep Grand Cherokee Limited  - Sunroof -  Navigation', '2013 Ford Escape SE', '2014 Ford Escape SE  - Bluetooth -  Heated Seats', '2012 Ford Explorer Base  -  Power Windows', '2014 Ford Escape SE  - Bluetooth -  Heated Seats', '2013 Ford Escape SEL  - Leather Seats -  Bluetooth', '2015 Ford Escape SE  - Bluetooth -  Heated Seats', '2015 Dodge Journey R/T', '2017 Ford Escape S  - Certified - Bluetooth', '2016 Ford C-Max SEL  - Navigation -  Leather Seats', '2016 Ford Escape SE  - Certified', '2014 Ford F-150 XLT', '2016 Ford Escape Titanium  - Certified -  SiriusXM', '2017 Ford Escape SE', '2017 Ford Escape SE  - One owner - Certified', '2019 Ford EcoSport SE 4WD', '2017 Ford Escape SE  - Certified - Low Mileage', '2016 Ford F-150 XLT  - One owner - Certified', '2012 Chevrolet Silverado 1500 LTZ  - Leather Seats']}
2020-06-14 14:02:13 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://api.autofi.com/v1/vehicle-service/vehicles/search> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2020-06-14 14:02:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=2> (referer: None)
b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:23.0) Gecko/20100101 Firefox/23.0'
200
https://www.marlboroughford.com/inventory?page=2
2020-06-14 14:02:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=2>
{'title': ['2018 Ford Escape SEL  - Certified', '2018 Ford Escape SEL  - Certified - Leather Seats', '2017 Chevrolet Traverse 1LT  -  Heated Seat -  SiriusXM', '2019 Ford EcoSport Titanium 4WD  - Top Luxury', '2015 Ford Edge Sport  - One owner - Local - Certified', '2014 Ford Expedition Max Limited', '2018 Ford Escape Titanium  - Certified', '2019 Ford Transit Connect XL  - Heated Mirrors', '2019 Ford Transit Connect XL  - Heated Mirrors', '2019 Ford Transit Connect XL  - Navigation -  SYNC 3', '2019 Ford Transit Connect XL  - Navigation -  SYNC', '2016 Ford Explorer Limited', '2020 Ford Fusion SE FWD  -  Bluetooth', '2019 Ford Fusion SE', '2020 Ford EcoSport SE 4WD', '2019 Ford Fusion SE  - Bluetooth -  SiriusXM', '2020 Ford EcoSport SE 4WD  - Sunroof -  Navigation', '2020 Ford EcoSport SES 4WD', '2020 Ford EcoSport SES 4WD  - Sunroof -  Navigation', '2020 Ford EcoSport SES 4WD  - Sunroof -  Navigation']}
2020-06-14 14:02:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=3> (referer: None)
b'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1'
200
https://www.marlboroughford.com/inventory?page=3
2020-06-14 14:02:21 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=3>
{'title': ['2020 Ford EcoSport SES 4WD  - Sunroof -  Navigation', '2020 Ford Escape S 4WD', '2020 Ford Escape SE  - Heated Seats -   Android Auto', '2014 Ram 1500 Laramie', '2016 Ford F-150 XLT  - SiriusXM - Low Mileage', '2020 Ford Escape SE 4WD', '2020 Ford Escape SE 4WD  - Heated Seats -  Android Auto', '2020 Ford Escape SE 4WD  - Heated Seats -  Android Auto', '2019 Ford Transit Connect XLT  -  Heated Mirrors', '2017 Ford F-150 XLT  - Bluetooth -   A/C - Low Mileage', '2019 Ford Mustang EcoBoost  - Navigation', '2017 Ford F-150 XLT  - Bluetooth -   A/C', '2020 Ford Transit Connect XLT  - Navigation', '2020 Ford Transit Connect XLT  - Navigation', '2020 Ford Transit Connect XLT  - Navigation', '2020 Ford Fusion Hybrid SEL FWD  - Sunroof', '2020 Ford Transit Connect XLT', '2017 Ford F-150 XLT  - Certified - Bluetooth -   A/C', '2020 Ford Mustang EcoBoost Fastback', '2020 Ford Mustang EcoBoost Fastback  - Aluminum Wheels']}
2020-06-14 14:02:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=4> (referer: None)
b'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6'
200
https://www.marlboroughford.com/inventory?page=4
2020-06-14 14:02:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=4>
{'title': ['2020 Ford Escape SEL 4WD  - ActiveX Seats -  Power Liftgate', '2020 Ford Escape SEL 4WD', '2017 Ford F-150 XLT  - Bluetooth -   A/C', '2020 Ford Escape SEL 4WD', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford Escape SEL 4WD', '2017 Ford F-150 XLT  - Certified - Low Mileage', '2019 Ford Ranger XLT', '2019 Ford Ranger XLT  -  Towing Package -  Keyless Entry', '2020 Ford Escape Titanium Hybrid 4WD', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford F-150 XLT  -  Android Auto', '2018 Ford F-150 XLT  - Bluetooth -  SiriusXM', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2019 Ford Ranger XLT  -  Towing Package', '2020 Ford Escape Titanium Hybrid 4WD  - Navigation', '2020 Ford Escape Titanium Hybrid 4WD  - Navigation', '2020 Ford Escape Titanium Hybrid 4WD']}
2020-06-14 14:02:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=5> (referer: None)
b'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5'
200
https://www.marlboroughford.com/inventory?page=5
2020-06-14 14:02:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=5>
{'title': ['2020 Ford Escape Titanium Hybrid 4WD', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford Edge SEL AWD', '2020 Ford Escape Titanium Hybrid 4WD', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford F-150 XLT  -  Android Auto', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT  -  Android Auto', '2017 Ford F-150 Lariat  -  Bluetooth', '2017 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford Ranger ', '2020 Ford Ranger ', '2020 Ford Explorer XLT  -  Wi-Fi -  Power Liftgate', '2020 Ford F-150 XLT', '2020 Ford Ranger ', '2020 Ford Ranger ', '2020 Ford Ranger ', '2020 Ford Ranger ']}
2020-06-14 14:02:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=6> (referer: None)
b'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3'
200
https://www.marlboroughford.com/inventory?page=6
2020-06-14 14:02:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=6>
{'title': ['2020 Ford F-150 XLT  -  Android Auto', '2020 Ford Ranger XLT', '2020 Ford Edge Titanium', '2019 Ford Flex SEL AWD  - Sunroof', '2019 Ford F-150 Lariat', '2020 Ford Transit-150 ', '2020 Ford F-150 XLT', '2020 Ford Edge Titanium', '2020 Ford Explorer XLT', '2020 Ford Explorer XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2019 Ford F-150 XLT  - Navigation -  Sunroof', '2020 Ford Explorer XLT', '2020 Ford Explorer XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT']}
2020-06-14 14:02:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=7> (referer: None)
b'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3'
200
https://www.marlboroughford.com/inventory?page=7
2020-06-14 14:02:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=7>
{'title': ['2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 XLT', '2020 Ford F-150 Lariat', '2019 Ford F-150 Lariat   - Sunroof', '2019 Ford F-150 Lariat   - Sunroof -  Running Boards', '2020 Ford F-150 Lariat', '2020 Ford F-150 Lariat', '2019 Ford F-150 Platinum', '2019 Ford F-150 Platinum   - Sunroof', '2020 Ford Explorer ST', '2020 Ford F-250 Super Duty XLT  - SYNC -  Trailer Hitch', '2020 Ford Explorer Platinum  - Sunroof', '2020 Ford F-150 Platinum  - Premium Seats -  Navigation', '2020 Ford F-150 Platinum  - Premium Seats -  Navigation', '2020 Ford F-150 Platinum', '2020 Ford F-350 Super Duty XLT', '2020 Ford F-150 Lariat', '2020 Ford Expedition Limited  - Navigation -  Sunroof', '2020 Ford F-150 Limited']}
2020-06-14 14:02:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.marlboroughford.com/inventory?page=8> (referer: None)
b'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3'
200
https://www.marlboroughford.com/inventory?page=8
2020-06-14 14:02:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.marlboroughford.com/inventory?page=8>
{'title': ['2020 Ford F-350 Super Duty Lariat', '2020 Ford F-350 Super Duty Lariat', '2020 Ford F-350 Super Duty Lariat', '2020 Ford Expedition Limited', '2020 Ford Expedition Max Platinum  - Navigation', '2020 Ford F-250 Super Duty King Ranch', '2020 Ford F-450 Super Duty ']}
© www.soinside.com 2019 - 2024. All rights reserved.