用Python发送100,000个HTTP请求的最快方法是什么？

Question

我正在打开一个具有100,000个URL的文件。我需要向每个URL发送一个HTTP请求并打印状态代码。我正在使用Python 2.6，到目前为止，我们研究了Python实现线程/并发性的许多令人困惑的方式。我什至看过python concurrence库，但无法弄清楚如何正确编写该程序。有没有人遇到过类似的问题？我想通常我需要知道如何尽快地在Python中执行数千个任务-我想这意味着“同时”。

Answer 1

如果要获得最佳性能，则可能要考虑使用异步I / O而不是线程。与数以千计的OS线程相关的开销是不平凡的，并且Python解释器中的上下文切换在其之上增加了更多。线程肯定可以完成工作，但是我怀疑异步路由会提供更好的整体性能。

Answer 2

解决方案：

from twisted.internet import reactor, threads
from urlparse import urlparse
import httplib
import itertools


concurrent = 200
finished=itertools.count(1)
reactor.suggestThreadPoolSize(concurrent)

def getStatus(ourl):
    url = urlparse(ourl)
    conn = httplib.HTTPConnection(url.netloc)   
    conn.request("HEAD", url.path)
    res = conn.getresponse()
    return res.status

def processResponse(response,url):
    print response, url
    processedOne()

def processError(error,url):
    print "error", url#, error
    processedOne()

def processedOne():
    if finished.next()==added:
        reactor.stop()

def addTask(url):
    req = threads.deferToThread(getStatus, url)
    req.addCallback(processResponse, url)
    req.addErrback(processError, url)   

added=0
for url in open('urllist.txt'):
    added+=1
    addTask(url.strip())

try:
    reactor.run()
except KeyboardInterrupt:
    reactor.stop()

Answer 3

我知道这是一个老问题，但是在Python 3.7中，您可以使用asyncio和aiohttp来做到这一点。

import asyncio
import aiohttp
from aiohttp import ClientSession, ClientConnectorError

async def fetch_html(url: str, session: ClientSession, **kwargs) -> tuple:
    try:
        resp = await session.request(method="GET", url=url, **kwargs)
    except ClientConnectorError:
        return (url, 404)
    return (url, resp.status)

async def make_requests(urls: set, **kwargs) -> None:
    async with ClientSession() as session:
        tasks = []
        for url in urls:
            tasks.append(
                fetch_html(url=url, session=session, **kwargs)
            )
        results = await asyncio.gather(*tasks)

    for result in results:
        print(f'{result[1]} - {str(result[0])}')

if __name__ == "__main__":
    import pathlib
    import sys

    assert sys.version_info >= (3, 7), "Script requires Python 3.7+."
    here = pathlib.Path(__file__).parent

    with open(here.joinpath("urls.txt")) as infile:
        urls = set(map(str.strip, infile))

    asyncio.run(make_requests(urls=urls))

Answer 4

使用thread pool是一个不错的选择，这将使之非常容易。不幸的是，python没有使线程池变得异常简单的标准库。但是这里有一个不错的图书馆，应该可以帮助您入门：http://www.chrisarndt.de/projects/threadpool/

Answer 5

创建epoll对象，打开许多客户端TCP套接字，将其发送缓冲区调整为比请求标头多一点，发送请求标头-应该立即发送，仅放入缓冲区中，在epoll对象中注册套接字，在.poll上执行epoll，从.poll的每个套接字读取前3个字节，将它们写入sys.stdout，然后写入\n（请勿刷新），关闭客户端套接字。

Answer 6

对于您的情况，线程可能会成功，因为您可能会花费大量时间等待响应。标准库中有一些有用的模块，例如Queue，可能会有所帮助。

Answer 7

考虑使用Windmill，尽管Windmill可能无法执行那么多线程。

您可以在5台计算机上使用手动滚动的Python脚本来完成此操作，每台计算机都使用端口40000-60000连接出站，从而打开100,000个端口连接。

Answer 8

这个扭曲的异步Web客户端运行得很快。

#!/usr/bin/python2.7

from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput

pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
codes = {}

def getLock(url, simultaneous = 1):
    return locks[urlparse(url).netloc, randrange(simultaneous)]

@inlineCallbacks
def getMapping(url):
    # Limit ourselves to 4 simultaneous connections per host
    # Tweak this number, but it should be no larger than pool.maxPersistentPerHost 
    lock = getLock(url,4)
    yield lock.acquire()
    try:
        resp = yield agent.request('HEAD', url)
        codes[url] = resp.code
    except Exception as e:
        codes[url] = str(e)
    finally:
        lock.release()


dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())

reactor.run()
pprint(codes)

Answer 9

最简单的方法是使用Python的内置线程库。 ~~它们不是“真实的” /内核线程~~

它们有问题（例如序列化），但是足够好。您需要一个队列和线程池。一个选项是here，但是编写自己的选项很简单。您无法并行处理所有100,000个呼叫，但可以同时触发100个（或大约）呼叫。

用Python发送100,000个HTTP请求的最快方法是什么？

问题描述投票：258回答：15

15个回答

最新问题

用Python发送100,000个HTTP请求的最快方法是什么？

问题描述 投票：258回答：15

15个回答

最新问题

问题描述投票：258回答：15