我正在尝试使用 scrapy 解析亚马逊印度网站,但我认为我的 IP 地址被屏蔽了...我想知道我的 IP 地址将被屏蔽多长时间?作为一个菜鸟,我现在添加了延迟和并发请求。我想知道是否可以使用 API 来代替?如果是这样我应该如何进行?顺便问一下,我有 Tunnel Bear 的 VPN,你认为它可以在不被阻止的情况下使用吗?这是我的代码:
# -*- coding: utf-8 -*-
"""
Created on Wed Nov 22 17:48:41 2023
"""
# ================
# Webscraping
# ================
import scrapy
from scrapy.spiders import Spider
from scrapy.http import Request
class AmazonSpider(Spider):
name = "amazon_in"
allowed_domains = ["amazon.in"]
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'amazon-in.csv',
'DOWNLOAD_DELAY': 5, # Add a delay to avoid overloading the server
'CONCURRENT_REQUESTS' : 1
}
def start_requests(self):
# Provide the correct path to ASIN file
file_path = 'unique_ASIN.txt'
# Read ASINs from the file
with open(file_path, 'r') as file:
asin_list = file.read().splitlines()
# Generate start URLs for each ASIN
for asin in asin_list:
url = f"https://www.amazon.in/dp/{asin}"
yield Request(url, callback=self.parse_item)
def parse_item(self, response):
# Extract information from the product page using XPath selectors
# Rating
rating = response.xpath(".//span[@data-hook='rating-out-of-text' and @class='a-size-medium a-color-base']/text()").get()
if rating:
pass
else:
rating = "No information"
material_compo = response.xpath(
'.//div[@class="a-fixed-left-grid-col a-col-right"]/span[@style="font-weight: 400;"]/span[@class="a-color-base"]/text()').extract_first()
brand = response.xpath('.//div[@class="a-section a-spacing-none"]/a/text()').get()
gender = response.xpath('.//span/a[@class="a-link-normal a-color-tertiary" and (contains(text(), "Men") or contains(text(), "Women") or contains(text(), "Kid") or contains(text(), "Girls") or contains(text(), "Boys"))]/text()').get()
yield {
'ASIN': url,
'Gender': gender,
'Composition': material_compo,
'Rating': rating,
'Brand': brand,
}
提前谢谢您
编辑:当我更改 VPN 时,它起作用了,我能够获得 49 个项目,但即使我正在更改 VPN,我也会收到此消息:
2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2023-11-23 09:34:03 [scrapy.middleware] INFO: Enabled item pipelines: [] 2023-11-23 09:34:03 [scrapy.core.engine] INFO: Spider opened 2023-11-23 09:34:03 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2023-11-23 09:34:03 [scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2023-11-23 09:34:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.in/robots.txt> (referer: None) 2023-11-23 09:34:09 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.amazon.in/dp/B081WVMMCY?th=1&psc=1> (failed 1 times): 503 Service Unavailable
EDIT2:当我更改用户代理的标题时,它起作用了;)
我必须为我的用户代理使用不同的标头;)