如何在Python Web Scraper中高效实现多线程?

问题描述 投票:0回答:1

Stack Overflow 社区您好, 我目前正在开发一个涉及使用 Python 和 BeautifulSoup 进行网页抓取的项目。我现在拥有的代码适用于较小的网站,但它不适用于包含数千个页面的较大网站,导致处理时间较长。这是我当前使用的单线程代码的简化版本:

import requests 
def scrape_website(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # ... (scraping code here) ... urls = ['http://example1.com', 'http://example2.com', 'http://example3.com'] for url in urls: scrape_website(url) I attempted to run the scraper sequentially on multiple URLs using a for loop, hoping for a quick process, but it".
python multithreading performance web-scraping beautifulsoup
1个回答
0
投票

为了加快该过程,您可以利用多线程或多处理同时抓取多个 URL。

import requests
from bs4 import BeautifulSoup
import threading

def scrape_website(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # ... (scraping code here) ...

def scrape_multiple_websites(urls):
    threads = []
    for url in urls:
        thread = threading.Thread(target=scrape_website, args=(url,))
        threads.append(thread)
        thread.start()

    # Wait for all threads to finish
    for thread in threads:
        thread.join()

if __name__ == "__main__":
    urls = ['http://example1.com', 'http://example2.com', 'http://example3.com']
    scrape_multiple_websites(urls)
© www.soinside.com 2019 - 2024. All rights reserved.