Python Selenium Web Scraper 在页面访问时避免 Cloudflare，但在抓取元素时则不然

Question

我创建了一个使用 Python 和 selenium 的网络爬虫。它从 CSV 读取 URL，导航到该 URL 并将元素写入另一个 CSV 文件。它必须从使用 Cloudflare 进行反机器人的网站中抓取数据。当 IDetection 第一次尝试通过 VBA 或 Python 使用 selenium 时，我根本无法访问该页面。 unDetected_chrome 也不起作用。我在某处读到您可以使用 (uc=True) 使您的浏览器无法检测到，所以我这样做了。

这种方法至少可以让我加载页面，这比我之前得到的要远（它曾经完全阻止我的硒浏览器）。但是，当我打开写入的 CSV 时，该值会列为“需要注意！Cloudflare”。

为什么当我打开页面时它能够欺骗cloudflare，但当我尝试抓取时却不能？我是否能够以某种方式修复它，比如在尝试抓取元素之前添加一个计时器，添加一些随机的鼠标点击等？我的代码如下所示。

    import csv
import requests
from bs4 import BeautifulSoup
from seleniumbase import Driver
import time

driver = Driver(uc=True)

# Open the CSV file with URLs in the first column and create an output CSV file
input_csv_path = r'myinputfilename.csv'
output_csv_path = r'myouputfilename.csv'
with open(input_csv_path, 'r') as input_file, open(output_csv_path, 'w', newline='') as 
  output_file:
    csv_reader = csv.reader(input_file)
    csv_writer = csv.writer(output_file)

    for row in csv_reader:
        link = row[0]  # Assuming the link is in the second column (index 1) of each row
        driver.get(link)

        response = requests.get(link)
        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract the title
        title = soup.title.string if soup.title else 'N/A'


        # Write the URL and title to the output CSV
        csv_writer.writerow([link, title])
        time.sleep(6)

driver.quit()

Answer 1

尝试这个“未被检测到的”Chrome 驱动程序：

https://github.com/ultrafunkamsterdam/unDetected-chromedriver

我已经有一段时间没有使用它了，但它似乎是一个维护良好的项目，所以我打赌它仍然未被发现。

Python Selenium Web Scraper 在页面访问时避免 Cloudflare，但在抓取元素时则不然

问题描述投票：0回答：1

1个回答

最新问题

Python Selenium Web Scraper 在页面访问时避免 Cloudflare，但在抓取元素时则不然

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1