如何使用python绕过验证码

问题描述 投票:0回答:1

我知道关于这个问题还有很多其他问题,但它们似乎……已经过时了(?)或者至少,它们不再起作用了。我尝试了多种方法,例如代理旋转器、自定义代理列表(理想情况下我想避免)、通过 python 使用 tor 会话,但除了此错误之外,这些方法都没有导致我出现其他任何问题:

--more--

 <p class="text-center text-red-800">
                                As you were using this website, something about your browser or behaviour made us think you might be a bot.<br/>Solve the captcha below to continue browsing the site.
                        </p>

--more--

基本上,我正在使用 python 来抓取一个提供房间、公寓等房产的网站。但是我提出了很多请求(这就是脚本的目标),现在我遇到了提到的验证码响应。

我的代码中发出请求和初始化会话的重要部分如下:

import requests
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
import random


def init(url):
    global session
    global proxies
    session = requests.Session()
    proxies = [
        'http://35.185.196.38:3128',
        'https://35.185.196.38:3128',
        'http://202.86.138.18:8080',
        'https://202.86.138.18:8080',
        'https://20.206.106.192:80',
        'https://20.210.113.32:80',
        'https://20.206.106.192:8123',
        'https://89.43.31.134:3128',
        'https://88.198.212.91:3128',
        'http://213.217.30.69:3128',
        'https://213.217.30.69:3128',
        'https://204.109.59.194:3121',
        'https://20.111.54.16:8123',
        'https://195.154.184.80:8080',

    ]

    proxy = random.choice(proxies)
    print(f"Using proxy: {proxy}")


    
    user_agent = UserAgent()
    session.headers.update({'User-Agent': str(user_agent)})

    response = session.get('https://[website_url]/')
    assert response.status_code == 200

    response = session.get('https://[website_url]/cgi-bin/fl/js/verify')
    assert response.status_code == 200

    try:
        response = session.get(url, proxies={'http': proxy, 'https': proxy}, timeout=10)
        return response
    except requests.exceptions.RequestException as e:
        print(f"Request error: {e}")
        return None
    

def scrape_website(url):
    response = session.get(url)
    if response.status_code == 200:
        print(response.text) # Here is where I print the response which contains the captcha response.
        -- rest of the code --

python web-scraping session python-requests captcha
1个回答
0
投票

您可以尝试从网站本身而不是 fake_useragent 添加标头吗? 添加 URL 网络中提到的所有标头。

例如:

标题= { '接受':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v =b3;q=0.7',

'接受编码':'gzip, deflate, br, zstd',

'接受语言':'en-US,en;q=0.9',

'缓存控制':'max-age=0,

“升级不安全请求”:“1”,

'用户代理':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/124.0.0.0 Safari/537.36'}

© www.soinside.com 2019 - 2024. All rights reserved.