scrapy 解析亚马逊

问题描述 投票:0回答:1

我正在尝试使用 scrapy 解析亚马逊印度网站,但我认为我的 IP 地址被屏蔽了...我想知道我的 IP 地址将被屏蔽多长时间?作为一个菜鸟,我现在添加了延迟和并发请求。我想知道是否可以使用 API 来代替?如果是这样我应该如何进行?顺便问一下,我有 Tunnel Bear 的 VPN,你认为它可以在不被阻止的情况下使用吗?这是我的代码:

# -*- coding: utf-8 -*-
"""
Created on Wed Nov 22 17:48:41 2023
"""
# ================
# Webscraping
# ================

import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request

class AmazonSpider(Spider):
    name = "amazon_in"
    allowed_domains = ["amazon.in"]

    custom_settings = {
        'FEED_FORMAT': 'csv',
        'FEED_URI': 'amazon-in.csv',
        'DOWNLOAD_DELAY': 5, # Add a delay to avoid overloading the server
        'CONCURRENT_REQUESTS' : 1
    }

    def start_requests(self):
        # Provide the correct path to ASIN file
        file_path = 'unique_ASIN.txt'

        # Read ASINs from the file
        with open(file_path, 'r') as file:
            asin_list = file.read().splitlines()

        # Generate start URLs for each ASIN
        for asin in asin_list:
            url = f"https://www.amazon.in/dp/{asin}"
            yield Request(url, callback=self.parse_item)

    def parse_item(self, response):
        # Extract information from the product page using XPath selectors
        # Rating
        rating = response.xpath(".//span[@data-hook='rating-out-of-text' and @class='a-size-medium a-color-base']/text()").get()
        if rating:
            pass
        else:
            rating = "No information"
        material_compo = response.xpath(
            './/div[@class="a-fixed-left-grid-col a-col-right"]/span[@style="font-weight: 400;"]/span[@class="a-color-base"]/text()').extract_first()
        brand = response.xpath('.//div[@class="a-section a-spacing-none"]/a/text()').get()
        gender = response.xpath('.//span/a[@class="a-link-normal a-color-tertiary" and (contains(text(), "Men") or contains(text(), "Women") or contains(text(), "Kid") or contains(text(), "Girls") or contains(text(), "Boys"))]/text()').get()

        yield {
            'ASIN': url,
            'Gender': gender,
            'Composition': material_compo,
            'Rating': rating,
            'Brand': brand,
        }

提前谢谢您

编辑:当我更改 VPN 时,它起作用了,我能够获得 49 个项目,但即使我正在更改 VPN,我也会收到此消息:

2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats',  'scrapy.extensions.telnet.TelnetConsole',  'scrapy.extensions.memusage.MemoryUsage',  'scrapy.extensions.feedexport.FeedExporter',  'scrapy.extensions.logstats.LogStats'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',  'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',  'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',  'scrapy.downloadermiddlewares.retry.RetryMiddleware',  'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',  'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',  'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',  'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',  'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',  'scrapy.spidermiddlewares.referer.RefererMiddleware',  'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',  'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled item pipelines: [] 2023-11-23 09:34:03 [scrapy.core.engine] INFO: Spider opened 2023-11-23 09:34:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2023-11-23 09:34:03 [scrapy.extensions.telnet] INFO: Telnet console listening on
    127.0.0.1:6023 2023-11-23 09:34:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/robots.txt> (referer: None) 2023-11-23 09:34:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.in/dp/B081WVMMCY?th=1&psc=1> (failed 1 times): 503 Service Unavailable

EDIT2:当我更改用户代理的标题时,它起作用了;)

python parsing scrapy vpn
1个回答
0
投票

我必须为我的用户代理使用不同的标头;)

© www.soinside.com 2019 - 2024. All rights reserved.