Scrapy 请求出现 403 错误，尽管 python 'get' 请求工作正常

Question

尝试使用Scrapy获取少数网站的内容，但它们都返回403（禁止）响应代码。尽管当我使用“获取”功能发出请求时，相同的网站工作正常，如下所示：

import requests
url = "https://www.name_of_website.com/"
headers = {
    "User-Agent": "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
}
response = requests.get(url, headers=headers)
print(response.status_code)

此外，网站可以像往常一样通过 Chrome 访问。我尝试通过 DEFAULT_REQUEST_HEADERS 与 Scrapy 中的 chrome 使用相同的标头，但仍然失败。

不知道为什么 scrapy 会失败，而常规 requests.get() 却可以工作。许多网站都观察到这种行为。我尝试使用 scrapy-fake-useragent 以及中间件，但是没有成功

任何线索或解决方案将不胜感激。

我在here看到类似的问题，但是没有帮助，因此需要向该领域的专家寻求新的想法。

谢谢

编辑（回答@ewoks和@Lakshmanrao Simhadri）：

我正在尝试以下网址用于研究目的，正如我提到的我收到的响应代码：

https://www.fastcompany.com/        -  403
https://www.ft.com/                 -  200
https://www.theinformation.com/     -  200
https://www.pcmag.com/              -  403
https://www.thestreet.com/          -  403

他们都没有使用过 scrapy。

我的Scrapy代码很简单如下：

class TheinformationSpider(scrapy.Spider):
    name = "theinformation"
    allowed_domains = ["www.theinformation.com"]
    start_urls = ["https://www.theinformation.com/"]

    def parse(self, response):
       print(response)

我现在只是在寻找响应代码。

我更新的设置如下：

DEFAULT_REQUEST_HEADERS = {
    "User-Agent": "Mozilla/5.0 (Linux; x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "http://www.google.com",
}

我在爬行时收到以下响应：

2024-03-08 15:15:54 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.theinformation.com/> (referer: http://www.google.com)
2024-03-08 15:15:54 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.theinformation.com/>: HTTP status code is not handled or not allowed
2024-03-08 15:15:54 [scrapy.core.engine] INFO: Closing spider (finished)
Total articles scrapped by "theinformation" = 0, null data = 0

Answer 1

我尝试从 chrome 传递精确的标头来发出 scrapy 请求，但仍然失败。通过使用一些代理，我可以获得响应。请看看下面的解决方案并让我知道您的想法

from urllib.parse import urlencode
import scrapy

# Get your own api_key from scrapeops or some other proxy vendor
API_KEY = "api_key"
def get_scrapeops_url(url):
        payload = {'api_key': API_KEY, 'url': url}
        proxy_url = 'https://proxy.scrapeops.io/v1/?' + urlencode(payload)
        return proxy_url

class FastCompany(scrapy.Spider):
        name = "fastcompany"

        def start_requests(self):
                urls = ["https://www.fastcompany.com/"]
                for url in urls:
                        proxy_url = get_scrapeops_url(url)
                        yield scrapy.Request(url=proxy_url, callback=self.parse)

        def parse(self, response):
                print(response)

Scrapy 请求出现 403 错误，尽管 python 'get' 请求工作正常

问题描述投票：0回答：1

1个回答

最新问题

Scrapy 请求出现 403 错误，尽管 python 'get' 请求工作正常

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1