我收到这个错误
[scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.bigbasket.com> (referer: None)
,[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.bigbasket.com>: HTTP status code is not handled or not allowed
为了解决这个问题,我在互联网上搜索并找到了一些解决方案,但它们不起作用,就像我已经尝试过
scrapy-user-agents
并将此代码粘贴到 setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
}
我也试试这个
pip install scrapy-random-useragent
我想从这个网站上抓取所有产品。请有人帮忙解决这个问题吗? 这是我的代码。
from urllib.parse import urljoin
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import pandas as pd
class GrocerySpider(scrapy.Spider):
name = "bigbasket"
def start_requests(self):
url = "https://www.bigbasket.com"
yield scrapy.Request(url, callback=self.parse_dir_contents)
def parse_dir_contents(self, response):
url = response.url
print("wow")
if __name__ == '__main__':
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(GrocerySpider)
process.start()