如何使用 Selenium 抓取此网站

问题描述 投票:0回答:2

我想抓取网站https://www.rome2rio.com。下面是我想出的代码。遗憾的是,我 99% 尝试时都会看到验证码。有人可以提示我可以在代码中添加哪些内容,或者如何修改它以改进这一点并避免被检测到。

谢谢

from selenium import webdriver
import undetected_chromedriver as uc
import time
import random

# Initialize undetected ChromeOptions
chrome_options = uc.ChromeOptions()

# Essential options to avoid detection
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--incognito")

# Correctly setting excludeSwitches within undetected_chromedriver context
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
chrome_options.add_argument("--start-maximized")  # To start maximized
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)

# Rotating User-Agent
user_agents = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
    # Add more as needed
]
random_user_agent = random.choice(user_agents)
chrome_options.add_argument(f"user-agent={random_user_agent}")

# Adjusting viewport size to non-standard dimensions if needed
# chrome_options.add_argument("--window-size=1366,768")  # Use only if you don't want to start maximized

# Use undetected_chromedriver to avoid detection
driver = uc.Chrome(options=chrome_options)

# Open the specified website
driver.get("https://www.rome2rio.com/map/Marseille/Paris")

# Mimicking human behavior with random sleep
time.sleep(random.uniform(2, 4))

# Proceed with your script...

# Close the driver after operations are complete
driver.quit()
selenium-webdriver captcha
2个回答
0
投票

我相信使用 2Captcha 或其他一些验证码解决服务的 API 来解决验证码将是比尝试逃避检测更可靠的解决方案。它们可能不是免费的,但对于大多数应用程序来说,它们的定价不是问题,根据验证码类型,每 1000 个请求 1-2 美元。


0
投票

您可以使用 https://github.com/seleniumbase/SeleniumBase UC 模式来避免验证码。

pip install seleniumbase
之后,您可以使用
python
运行以下命令:

from seleniumbase import Driver

driver = Driver(uc=True)
driver.uc_open_with_reconnect("https://www.rome2rio.com/map/Marseille/Paris", 3)
driver.type('input[aria-label="From"]', "Geneva, Switzerland")
driver.type('input[aria-label="To"]', "Vienna, Austria")
driver.click('button span:contains("Search")')

breakpoint()

driver.quit()

脚本在

breakpoint()
处暂停。在控制台中输入
c
并按
Enter
从断点处继续。

有关 UC 模式的更多文档:SeleniumBase/help_docs/uc_mode.md

SeleniumBase

driver
包含所有原始
driver
方法以及新方法。

© www.soinside.com 2019 - 2024. All rights reserved.