当尝试用 python 抓取网站产品时,Scrapy 不处理或不允许 HTTP 状态代码

问题描述 投票:0回答:0

我收到这个错误

[scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.bigbasket.com> (referer: None)
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.bigbasket.com>: HTTP status code is not handled or not allowed

为了解决这个问题,我在互联网上搜索并找到了一些解决方案,但它们不起作用,就像我已经尝试过

scrapy-user-agents
并将此代码粘贴到 setting.py

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}

我也试试这个

pip install scrapy-random-useragent

我想从这个网站上抓取所有产品。请有人帮忙解决这个问题吗? 这是我的代码。

from urllib.parse import urljoin
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import pandas as pd


class GrocerySpider(scrapy.Spider):
    name = "bigbasket"
    
    def start_requests(self):
        url = "https://www.bigbasket.com"
        
        yield scrapy.Request(url, callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        url = response.url
        print("wow")
        
if __name__ == '__main__':
    settings = get_project_settings()
    process = CrawlerProcess(settings)
    process.crawl(GrocerySpider)
    
    process.start()
python proxy scrapy user-agent
© www.soinside.com 2019 - 2024. All rights reserved.