如何在scrapy-selenium中绕过cloudflare验证?

问题描述 投票:0回答:1

我尝试从法国网站上删除专业号码,但收到 403 错误,并且被 Clouflares 阻止。我使用 Selenium 和 Scrapy。我添加了 scrapy cloudflares 中间件,但仍然不起作用。我还向 selenium 选项添加了一些选项参数。

蜘蛛.py:

import scrapy
import random
from scrapy_selenium import SeleniumRequest
from scrapy.selector import Selector
from selenium import webdriver


class ApiPbSpider(scrapy.Spider):
    name = 'api_pb'

    def start_requests(self):
        yield SeleniumRequest(
            url = 'https://www.pagesjaunes.fr/pagesblanches/recherche?quoiqui=sylvie&ou=Saint+Beno%C3%AEt+%2886280%29&univers=pagesblanches&idOu=L08621400',
            callback=self.parse,
            wait_time = 15,
         )
    
def parse(self, response):
    driver = response.meta['driver']
    code_page = driver.page_source
    print(code_page)

设置.py:

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS=['--headless']  # '--headless' if using chrome instead of firefox


# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36'


# Obey robots.txt rules
ROBOTSTXT_OBEY = False


# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = True

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
    'Accept': '/',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language':'fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7',
    'Origin': 'https://www.pagesjaunes.fr/pagesblanches/',
    'Referer':'https://www.pagesjaunes.fr/pagesblanches/',
    'Sec-Ch-Ua':'"Google Chrome";v="113", "Chromium";v="113", "Not-A.Brand";v="24"',
    'Sec-Ch-Ua-Mobile':'?0',
    'Sec-Ch-Ua-Platform':'"Windows"',
    'Sec-Fetch-Dest':'empty',
    'Sec-Fetch-Mode': 'cors',
    'Sec-Fetch-Site': 'same-site',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'api_pages_blanches.middlewares.ApiPagesBlanchesSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    # The priority of 560 is important, because we want this middleware to kick in just before the scrapy built-in `RetryMiddleware`.
    'scrapy_cloudflare_middleware.middlewares.CloudFlareMiddleware': 560,
    'scrapy_selenium.SeleniumMiddleware': 800
}

但是,如果我添加住宅代理,我会得到代码 200,但收到的是空主体。你有什么想法吗?

python selenium-webdriver web-scraping scrapy scrapy-selenium
1个回答
0
投票

之前已经尝试过被某些网站阻止,因为我使用 Selenium 处于无头模式。首先尝试关闭无头模式。如果它有效,则意味着需要 JavaScript 来加载网站并继续抓取。

只需删除

SELENIUM_DRIVER_ARGUMENTS = ['--headless']
中的
settings.py
行即可。

© www.soinside.com 2019 - 2024. All rights reserved.