我知道关于这个问题还有很多其他问题,但它们似乎……已经过时了(?)或者至少,它们不再起作用了。我尝试了多种方法,例如代理旋转器、自定义代理列表(理想情况下我想避免)、通过 python 使用 tor 会话,但除了此错误之外,这些方法都没有导致我出现其他任何问题:
--more--
<p class="text-center text-red-800">
As you were using this website, something about your browser or behaviour made us think you might be a bot.<br/>Solve the captcha below to continue browsing the site.
</p>
--more--
基本上,我正在使用 python 来抓取一个提供房间、公寓等房产的网站。但是我提出了很多请求(这就是脚本的目标),现在我遇到了提到的验证码响应。
我的代码中发出请求和初始化会话的重要部分如下:
import requests
from bs4 import BeautifulSoup
import csv
from fake_useragent import UserAgent
import random
def init(url):
global session
global proxies
session = requests.Session()
proxies = [
'http://35.185.196.38:3128',
'https://35.185.196.38:3128',
'http://202.86.138.18:8080',
'https://202.86.138.18:8080',
'https://20.206.106.192:80',
'https://20.210.113.32:80',
'https://20.206.106.192:8123',
'https://89.43.31.134:3128',
'https://88.198.212.91:3128',
'http://213.217.30.69:3128',
'https://213.217.30.69:3128',
'https://204.109.59.194:3121',
'https://20.111.54.16:8123',
'https://195.154.184.80:8080',
]
proxy = random.choice(proxies)
print(f"Using proxy: {proxy}")
user_agent = UserAgent()
session.headers.update({'User-Agent': str(user_agent)})
response = session.get('https://[website_url]/')
assert response.status_code == 200
response = session.get('https://[website_url]/cgi-bin/fl/js/verify')
assert response.status_code == 200
try:
response = session.get(url, proxies={'http': proxy, 'https': proxy}, timeout=10)
return response
except requests.exceptions.RequestException as e:
print(f"Request error: {e}")
return None
def scrape_website(url):
response = session.get(url)
if response.status_code == 200:
print(response.text) # Here is where I print the response which contains the captcha response.
-- rest of the code --
您可以尝试从网站本身而不是 fake_useragent 添加标头吗? 添加 URL 网络中提到的所有标头。
例如:
标题= { '接受':'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v =b3;q=0.7',
'接受编码':'gzip, deflate, br, zstd',
'接受语言':'en-US,en;q=0.9',
'缓存控制':'max-age=0,
“升级不安全请求”:“1”,
'用户代理':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/124.0.0.0 Safari/537.36'}