Stack Overflow 社区您好, 我目前正在开发一个涉及使用 Python 和 BeautifulSoup 进行网页抓取的项目。我现在拥有的代码适用于较小的网站,但它不适用于包含数千个页面的较大网站,导致处理时间较长。这是我当前使用的单线程代码的简化版本:
import requests
def scrape_website(url): response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # ... (scraping code here) ... urls = ['http://example1.com', 'http://example2.com', 'http://example3.com'] for url in urls: scrape_website(url) I attempted to run the scraper sequentially on multiple URLs using a for loop, hoping for a quick process, but it".
为了加快该过程,您可以利用多线程或多处理同时抓取多个 URL。
import requests
from bs4 import BeautifulSoup
import threading
def scrape_website(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# ... (scraping code here) ...
def scrape_multiple_websites(urls):
threads = []
for url in urls:
thread = threading.Thread(target=scrape_website, args=(url,))
threads.append(thread)
thread.start()
# Wait for all threads to finish
for thread in threads:
thread.join()
if __name__ == "__main__":
urls = ['http://example1.com', 'http://example2.com', 'http://example3.com']
scrape_multiple_websites(urls)