`scrapy` 无法从网站获得响应,但 `requests` 可以

问题描述 投票:0回答:1

我正在使用

scrapy
来抓取 this 页面

但由于某种原因

scrapy
无法收到该网站的回复。 当我运行爬虫时,我收到 https 500 错误

这是我的基本

spider

import scrapy

class SavingsGov(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/'
    ]

    def parse(self, response):
        for option in response.css('select option'):
            yield {
                'url': option.css('::attr(value)').get()
            }

这是我运行时遇到的错误,(我还在

settings.py
中将重试次数增加到10次)

2023-08-26 16:30:22 [scrapy.core.engine] INFO: Spider opened
2023-08-26 16:30:22 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2023-08-26 16:30:22 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2023-08-26 16:30:24 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:25 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:28 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:30 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:31 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:33 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:35 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:37 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:39 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/robots.txt> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/robots.txt> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:40 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/robots.txt> (referer: None)
2023-08-26 16:30:40 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2023-08-26 16:30:41 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 1 times): 500 Internal Server Error
2023-08-26 16:30:43 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 2 times): 500 Internal Server Error
2023-08-26 16:30:44 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 3 times): 500 Internal Server Error
2023-08-26 16:30:46 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 4 times): 500 Internal Server Error
2023-08-26 16:30:47 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 5 times): 500 Internal Server Error
2023-08-26 16:30:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 6 times): 500 Internal Server Error
2023-08-26 16:30:50 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 7 times): 500 Internal Server Error
2023-08-26 16:30:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 8 times): 500 Internal Server Error
2023-08-26 16:30:53 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 9 times): 500 Internal Server Error
2023-08-26 16:30:55 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://savings.gov.pk/download-draws/> (failed 10 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET https://savings.gov.pk/download-draws/> (failed 11 times): 500 Internal Server Error
2023-08-26 16:30:56 [scrapy.core.engine] DEBUG: Crawled (500) <GET https://savings.gov.pk/download-draws/> (referer: None)
2023-08-26 16:30:56 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <500 https://savings.gov.pk/download-draws/>: HTTP status code is not handled or not allowed
2023-08-26 16:30:56 [scrapy.core.engine] INFO: Closing spider (finished)

但我可以使用 python 的

requests
模块轻松获得响应。 这是Python代码

import requests

response = requests.get('https://savings.gov.pk/download-draws/')
print(response.text)

我不知道为什么会发生这种情况,我假设问题出在

scrapy.Request

有没有办法用

requests
执行请求并将响应传递给
scrapy
?但更好的选择是以某种方式进行调试
scrapy.Request

我是

scrapy
的新手,所以如果我可能误解了这个问题,请告诉我。

web-scraping python-requests scrapy web-crawler
1个回答
0
投票

这很可能是因为服务器可能拒绝来自 scrapy 默认用户代理的请求。

尝试在蜘蛛自定义设置中设置自定义设置。

例如:

import scrapy

class SavingsGov(scrapy.Spider):
    name        = 'savings'
    start_urls  = [
        'https://savings.gov.pk/download-draws/'
    ]
    custom_settings = {
        "USER_AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
    }

    def parse(self, response):
        for option in response.css('select option'):
            yield {
                'url': option.css('::attr(value)').get()
            }

部分输出:

2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draw-list/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-200-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-1500-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-15000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-premium-bonds-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-40000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-25000-draws/'}
2023-08-26 21:11:24 [scrapy.core.scraper] DEBUG: Scraped from <200 https://savings.gov.pk/download-draws/>
{'url': 'http://savings.gov.pk/rs-7500-draws/'}
© www.soinside.com 2019 - 2024. All rights reserved.