middlewares.py
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = "http://103.35.64.12:3128"
return None
settings.py
BOT_NAME = 'SGinfotrackker'
SPIDER_MODULES = ['SGinfotrackker.spiders']
NEWSPIDER_MODULE = 'SGinfotrackker.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 15
DOWNLOADER_MIDDLEWARES = {'SGinfotrackker.middlewares.CustomProxyMiddleware': 350,
'SGinfotrackker.middlewares.SginfotrackkerDownloaderMiddleware': None,
}
首先我得到
忽略响应<520 [C0:HTTP状态代码为未处理或不允许
然后我就收到了
403 HTTP状态码未处理或不允许
scrapy runspider
运行Spider?理想情况下,您应该使用scrapy crawl
来使用您的设置。此外,作为自定义中间件的替代,您可以将scrapy crawl
和http_proxy
环境变量设置为在所有请求中使用代理。
您可以在启动Spider之前在外部进行设置,或者在开始时在脚本内部进行设置,例如:
https_proxy