我想获取2000个url的状态码。我想将 url 的状态代码存储为字典键,并将值存储为 url 本身。我也想尽快做到这一点。我看过有关 async 和 ThreadPoolExecutor 的内容,但我还不知道如何使用它们。如何有效解决这个问题?
这是我尝试过的:
import requests
def check_urls(list_of_urls):
result = {"200": [], "404": [], "anything_else": []}
for url in list_of_urls:
try:
response = requests.get(url)
if response.status_code == 200:
result["200"].append(url)
elif response.status_code == 404:
result["404"].append(url)
else:
result["anything_else"].append((url, f"HTTP Error {response.status_code}"))
except requests.exceptions.RequestException as e:
result["anything_else"] = ((url, e))
return result
有什么方法可以改进此代码,使其更快地处理 2000 个 URL?我已经尝试过
requests.head
但不准确。
假设您将所有网址存储在列表中:
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://nonexistant-subdomain.python.org/']
然后您可以使用这两种解决方案中的任何一个:
解决方案 1 - 多重处理
您可以使用库
concurrent
进行多线程执行。我还建议检查库文档 - 它有一个非常简洁的示例,非常接近您的情况(https://docs.python.org/3/library/concurrent.futures.html)
import concurrent.futures
from multiprocessing import cpu_count
import requests
def load_url(url):
# Retrieve a single page and returns a status code
try:
response = requests.get(url)
return response.status_code
except:
return 404
n_threads = cpu_count()
print (f"Count threads available - {n_threads}")
# you need to use 'with' statement to ensure threads are cleaned up promptly after finishing jobs
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
status_code = future.result()
print(url, status_code)
解决方案 2 - 异步
不幸的是
requests
库不支持 async
调用,因此您需要即兴发挥。或者安装并使用grequests
import asyncio
import aiohttp
async def async_aiohttp_get_all(urls, cookies):
async with aiohttp.ClientSession(cookies=cookies) as session:
async def fetch(url):
try:
async with session.get(url) as response:
return response.status
except:
return 404
return await asyncio.gather(*[
fetch(url) for url in urls
])
results = asyncio.run(async_aiohttp_get_all(URLS, None))
for i, url in enumerate(URLS):
print(url, results[i])