快速获取2000个URL的状态码并将其存储为字典输出

问题描述 投票:0回答:1

我想获取2000个url的状态码。我想将 url 的状态代码存储为字典键,并将值存储为 url 本身。我也想尽快做到这一点。我看过有关 async 和 ThreadPoolExecutor 的内容,但我还不知道如何使用它们。如何有效解决这个问题?

这是我尝试过的:

import requests 


def check_urls(list_of_urls):
    
    result = {"200": [], "404": [], "anything_else": []}
    
    for url in list_of_urls:
        try:
            response = requests.get(url)
            if response.status_code == 200:
                result["200"].append(url)
            elif response.status_code == 404:
                result["404"].append(url)
            else:
                result["anything_else"].append((url, f"HTTP Error {response.status_code}"))
        except requests.exceptions.RequestException as e:
            result["anything_else"] = ((url, e))
    
    return result 

有什么方法可以改进此代码,使其更快地处理 2000 个 URL?我已经尝试过

requests.head
但不准确。

python asynchronous python-requests threadpoolexecutor
1个回答
0
投票

假设您将所有网址存储在列表中:

URLS = ['http://www.foxnews.com/',
        'http://www.cnn.com/',
        'http://europe.wsj.com/',
        'http://www.bbc.co.uk/',
        'http://nonexistant-subdomain.python.org/']

然后您可以使用这两种解决方案中的任何一个:

解决方案 1 - 多重处理

您可以使用库

concurrent
进行多线程执行。我还建议检查库文档 - 它有一个非常简洁的示例,非常接近您的情况(https://docs.python.org/3/library/concurrent.futures.html

import concurrent.futures
from multiprocessing import cpu_count
import requests

def load_url(url):
    # Retrieve a single page and returns a status code
    try:      
        response = requests.get(url)
        return response.status_code
    except:
        return 404



n_threads = cpu_count()
print (f"Count threads available - {n_threads}")

# you need to use 'with' statement to ensure threads are cleaned up promptly after finishing jobs
with concurrent.futures.ThreadPoolExecutor(max_workers=n_threads) as executor:
    # Start the load operations and mark each future with its URL
    future_to_url = {executor.submit(load_url, url): url for url in URLS}
    for future in concurrent.futures.as_completed(future_to_url):
        url = future_to_url[future]
        status_code = future.result()
        print(url, status_code)

解决方案 2 - 异步

不幸的是

requests
库不支持
async
调用,因此您需要即兴发挥。或者安装并使用
grequests

import asyncio
import aiohttp

async def async_aiohttp_get_all(urls, cookies):
    async with aiohttp.ClientSession(cookies=cookies) as session:
        async def fetch(url):
            try:
                async with session.get(url) as response:
                    return response.status
            except:
                return 404
        return await asyncio.gather(*[
            fetch(url) for url in urls
        ])

results = asyncio.run(async_aiohttp_get_all(URLS, None))
for i, url in enumerate(URLS):
    print(url, results[i])
© www.soinside.com 2019 - 2024. All rights reserved.