我有几个获取网页和解析信息的脚本。
(一个例子可以在http://bluedevilbooks.com/search/?DEPT=MATH&CLASS=103&SEC=01可见)
我跑它CPROFILE,并作为我以为,urlopen接收了大量的时间。有没有一种方法更快地获取网页?还是有办法,一次撷取几页?我会做什么是最简单的,因为我是新来的Python和网络开发。
提前致谢! :)
更新:我有一个名为fetchURLs()
功能,我使用,使我需要这么像urls = fetchURLS()
.The URL来自亚马逊和eBay的API(这让我困惑,为什么需要这么长的所有XML文件加载的URL的数组,也许我的虚拟主机提供商是慢?)
我需要做的是加载每个URL,阅读每个网页,并将该数据发送到将解析和显示数据的脚本的另一部分。
请注意,我不能做后期,直到所有的页面都被取出,这就是我的问题是什么。
另外,我的主人限制了我25道工序的时间,我相信,所以无论是最简单的服务器上就好了:)
这是时间:
Sun Aug 15 20:51:22 2010 prof
211352 function calls (209292 primitive calls) in 22.254 CPU seconds
Ordered by: internal time
List reduced from 404 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
10 18.056 1.806 18.056 1.806 {_socket.getaddrinfo}
4991 2.730 0.001 2.730 0.001 {method 'recv' of '_socket.socket' objects}
10 0.490 0.049 0.490 0.049 {method 'connect' of '_socket.socket' objects}
2415 0.079 0.000 0.079 0.000 {method 'translate' of 'unicode' objects}
12 0.061 0.005 0.745 0.062 /usr/local/lib/python2.6/HTMLParser.py:132(goahead)
3428 0.060 0.000 0.202 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1306(endData)
1698 0.055 0.000 0.068 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1351(_smartPop)
4125 0.053 0.000 0.056 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:118(setup)
1698 0.042 0.000 0.358 0.000 /usr/local/lib/python2.6/HTMLParser.py:224(parse_starttag)
1698 0.042 0.000 0.275 0.000 /usr/local/lib/python2.6/site-packages/BeautifulSoup.py:1397(unknown_starttag)
编辑:我扩大的答案,包括一个更完美的例子。我发现在这个岗位关于线程V.S.很多敌意和误解异步I / O。因此,我也加入了更多论据来反驳某些无效的要求。我希望这将有助于人们选择合适的工作的正确工具。
这是一个DUP到问题3天前。
Python的urllib2.open是缓慢的,需要一个更好的方式来读取多个URL - 堆栈溢出Python urllib2.urlopen() is slow, need a better way to read several urls
我抛光代码来显示如何在并行使用线程读取多个网页。
import time
import threading
import Queue
# utility - spawn a thread to execute target for each args
def run_parallel_in_threads(target, args_list):
result = Queue.Queue()
# wrapper to collect return value in a Queue
def task_wrapper(*args):
result.put(target(*args))
threads = [threading.Thread(target=task_wrapper, args=args) for args in args_list]
for t in threads:
t.start()
for t in threads:
t.join()
return result
def dummy_task(n):
for i in xrange(n):
time.sleep(0.1)
return n
# below is the application code
urls = [
('http://www.google.com/',),
('http://www.lycos.com/',),
('http://www.bing.com/',),
('http://www.altavista.com/',),
('http://achewood.com/',),
]
def fetch(url):
return urllib2.urlopen(url).read()
run_parallel_in_threads(fetch, urls)
正如你所看到的,应用程序特定代码只有3行,可如果你是激进性收缩到1线。我不认为任何人可以证明自己的要求,这是复杂和难以维护。
不幸的是张贴在这里大多数其他线程的代码有一些缺陷。其中很多积极投票,以等待代码完成。 join()
是一种更好的方式来同步的代码。我觉得这个代码已经在所有的线程到目前为止,我们都提高了。
持续连线
如果你的网址都指向同一服务器WoLpH的有关使用保持活动连接的建议可能是非常有用的。
扭曲
亚伦·加拉格尔是twisted
框架的球迷,他是敌对谁提出任何线索的人。不幸的是他的很多说法都是误传。例如,他说:“-1暗示线程这是IO的限制;线程是没用在这里。”这违背了证据,既尼克T和我已经证明在使用线程速度增益。事实上I / O密集型应用程序的最使用Python的线程(V.S.在CPU绑定的应用程序就没有收获)获得。亚伦的上线误导的批评,表明他是比较混乱对一般的并行编程。
合适的工作正确的工具
我很清楚的问题是关于使用线程,蟒蛇,异步I / O等方面的并行编程。每个工具都有自己的优点和缺点。对于每一种情况有适当的工具。我不反对扭曲的(虽然我还没有部署一个自己)。但我不相信我们能全力以赴说,线程是坏的,扭曲的是在所有的情况下取得良好。
例如,如果OP的要求是在平行获取万网站,异步I / O会prefereable。线程不会是可占用(除非可能与无堆栈的Python)。
亚伦的反对主题大多是概括。他没有认识到这是一个简单的并行任务。每个任务是独立的,不共享资源。因此,大多数他的攻击不适用。
鉴于我的代码没有外部的依赖,我会打电话给合适的工作合适的工具。
性能
我想大多数人都会同意这个任务的性能是在很大程度上取决于网络代码和外部服务器,其中平台代码的性能应该有不可忽视的作用。然而亚伦的基准显示在线程代码的50%的速度增加。我认为有必要针对这一明显的速度增益。
在尼克的代码,有可能造成低效率的明显缺陷。但是,你怎么解释了我的代码233ms速度增益?我想即使扭曲的球迷将跳进结论,这归因于扭曲的效率避免。还有,毕竟,庞大的系统代码的变量外,像远程服务器的性能,网络,缓存和urllib2的和扭曲的Web客户端等之间的差异实现的量。
只是为了确保Python的线程不会招致巨大的低效率的量,我做一个快速的基准产卵5个线程,然后500线程。我很舒服地说产卵5线程的开销可以忽略不计,不能解释233ms速度差。
In [274]: %time run_parallel_in_threads(dummy_task, [(0,)]*5)
CPU times: user 0.00 s, sys: 0.00 s, total: 0.00 s
Wall time: 0.00 s
Out[275]: <Queue.Queue instance at 0x038B2878>
In [276]: %time run_parallel_in_threads(dummy_task, [(0,)]*500)
CPU times: user 0.16 s, sys: 0.00 s, total: 0.16 s
Wall time: 0.16 s
In [278]: %time run_parallel_in_threads(dummy_task, [(10,)]*500)
CPU times: user 1.13 s, sys: 0.00 s, total: 1.13 s
Wall time: 1.13 s <<<<<<<< This means 0.13s of overhead
在我的并行取进一步的测试显示,在17个运行响应时间的巨大变化。 (不幸的是我没有扭曲验证亚伦的代码)。
0.75 s
0.38 s
0.59 s
0.38 s
0.62 s
1.50 s
0.49 s
0.36 s
0.95 s
0.43 s
0.61 s
0.81 s
0.46 s
1.21 s
2.87 s
1.04 s
1.72 s
我的测试不支持亚伦的结论,即线程是一个衡量利润率比异步I / O一贯慢。鉴于所涉及的变量数,我不得不说,这不是一个有效的测试测量异步I / O和线程之间的系统性能差异。
这里有一个标准库解决方案。它不是那么快,但它使用较少的内存比螺纹解决方案。
try:
from http.client import HTTPConnection, HTTPSConnection
except ImportError:
from httplib import HTTPConnection, HTTPSConnection
connections = []
results = []
for url in urls:
scheme, _, host, path = url.split('/', 3)
h = (HTTPConnection if scheme == 'http:' else HTTPSConnection)(host)
h.request('GET', '/' + path)
connections.append(h)
for h in connections:
results.append(h.getresponse().read())
此外,如果大多数请求是在同一台主机,然后重复使用相同的HTTP连接可能会帮助不是并行做事多。
请找到单个连接缓慢识别网络的Python脚本基准:
"""Python network test."""
from socket import create_connection
from time import time
try:
from urllib2 import urlopen
except ImportError:
from urllib.request import urlopen
TIC = time()
create_connection(('216.58.194.174', 80))
print('Duration socket IP connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
create_connection(('google.com', 80))
print('Duration socket DNS connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
urlopen('http://216.58.194.174')
print('Duration urlopen IP connection (s): {:.2f}'.format(time() - TIC))
TIC = time()
urlopen('http://google.com')
print('Duration urlopen DNS connection (s): {:.2f}'.format(time() - TIC))
和示例与Python 3.6的结果:
Duration socket IP connection (s): 0.02
Duration socket DNS connection (s): 75.51
Duration urlopen IP connection (s): 75.88
Duration urlopen DNS connection (s): 151.42
Python的2.7.13具有非常相似的结果。
在这种情况下,DNS和的urlopen缓慢很容易识别。
使用twisted!它使这种事情相比,比方说,使用线程荒谬容易。
from twisted.internet import defer, reactor
from twisted.web.client import getPage
import time
def processPage(page, url):
# do somewthing here.
return url, len(page)
def printResults(result):
for success, value in result:
if success:
print 'Success:', value
else:
print 'Failure:', value.getErrorMessage()
def printDelta(_, start):
delta = time.time() - start
print 'ran in %0.3fs' % (delta,)
return delta
urls = [
'http://www.google.com/',
'http://www.lycos.com/',
'http://www.bing.com/',
'http://www.altavista.com/',
'http://achewood.com/',
]
def fetchURLs():
callbacks = []
for url in urls:
d = getPage(url)
d.addCallback(processPage, url)
callbacks.append(d)
callbacks = defer.DeferredList(callbacks)
callbacks.addCallback(printResults)
return callbacks
@defer.inlineCallbacks
def main():
times = []
for x in xrange(5):
d = fetchURLs()
d.addCallback(printDelta, time.time())
times.append((yield d))
print 'avg time: %0.3fs' % (sum(times) / len(times),)
reactor.callWhenRunning(main)
reactor.run()
该代码还执行比任何张贴(编辑后,我关闭使用大量的带宽,有些事情)其他解决方案更好:
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 29996)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.518s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.461s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30033)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.435s
Success: ('http://www.google.com/', 8117)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.449s
Success: ('http://www.google.com/', 8135)
Success: ('http://www.lycos.com/', 30349)
Success: ('http://www.bing.com/', 28611)
Success: ('http://www.altavista.com/', 8378)
Success: ('http://achewood.com/', 15043)
ran in 0.547s
avg time: 0.482s
并采用尼克T的代码,操纵上也给五平均值和更好的显示输出:
Starting threaded reads:
...took 1.921520 seconds ([8117, 30070, 15043, 8386, 28611])
Starting threaded reads:
...took 1.779461 seconds ([8135, 15043, 8386, 30349, 28611])
Starting threaded reads:
...took 1.756968 seconds ([8135, 8386, 15043, 30349, 28611])
Starting threaded reads:
...took 1.762956 seconds ([8386, 8135, 15043, 29996, 28611])
Starting threaded reads:
...took 1.654377 seconds ([8117, 30349, 15043, 8386, 28611])
avg time: 1.775s
Starting sequential reads:
...took 1.389803 seconds ([8135, 30147, 28611, 8386, 15043])
Starting sequential reads:
...took 1.457451 seconds ([8135, 30051, 28611, 8386, 15043])
Starting sequential reads:
...took 1.432214 seconds ([8135, 29996, 28611, 8386, 15043])
Starting sequential reads:
...took 1.447866 seconds ([8117, 30028, 28611, 8386, 15043])
Starting sequential reads:
...took 1.468946 seconds ([8153, 30051, 28611, 8386, 15043])
avg time: 1.439s
并使用伟业东的代码:
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30051 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.704s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.845s
Fetched 8153 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30070 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.689s
Fetched 8117 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30114 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.647s
Fetched 8135 from http://www.google.com/
Fetched 28611 from http://www.bing.com/
Fetched 8386 from http://www.altavista.com/
Fetched 30349 from http://www.lycos.com/
Fetched 15043 from http://achewood.com/
done in 0.693s
avg time: 0.715s
我得说,我做这样的顺序进行取对我好。
这里是使用python Threads
一个例子。这里的其他螺纹例子发动每个URL线程,这是不是很友善的行为,如果它引起太多命中为服务器来处理(例如,它是很常见的蜘蛛具有相同的主机上的网址)
from threading import Thread
from urllib2 import urlopen
from time import time, sleep
WORKERS=1
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']*10
results = []
class Worker(Thread):
def run(self):
while urls:
url = urls.pop()
results.append((url, urlopen(url).read()))
start = time()
threads = [Worker() for i in range(WORKERS)]
any(t.start() for t in threads)
while len(results)<40:
sleep(0.1)
print time()-start
注意:这里给出的时间是40个URL和将取决于很多关于您的互联网连接的速度和延迟到服务器。作为在澳大利亚,我平是> 300ms的
随着WORKERS=1
花了86秒运行
随着WORKERS=4
它花了23秒,跑
与WORKERS=10
花了10秒运行
因此具有10个线程下载是8.6倍的速度作为一个单一的线程。
下面是一个使用队列的升级版。至少有两个优点。
1.网址,请按顺序,它们出现在列表中
2.可以使用q.join()
检测何时请求已经全部完成
3.将结果保存在同一个数量级的URL列表
from threading import Thread
from urllib2 import urlopen
from time import time, sleep
from Queue import Queue
WORKERS=10
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']*10
results = [None]*len(urls)
def worker():
while True:
i, url = q.get()
# print "requesting ", i, url # if you want to see what's going on
results[i]=urlopen(url).read()
q.task_done()
start = time()
q = Queue()
for i in range(WORKERS):
t=Thread(target=worker)
t.daemon = True
t.start()
for i,url in enumerate(urls):
q.put((i,url))
q.join()
print time()-start
实际的等待可能不是在urllib2
但是在服务器和/或网络连接到服务器。
有加快这件事的2种方式。
multiprocessing
LIB让事情很容易。现在有非常出色的Python lib中,对于你叫requests做到这一点。
如果您想根据(使用引擎盖下GEVENT)如果你想基于非阻塞IO解决方案的线程或异步API解决方案使用要求标准的API。
因为这个问题被张贴,它看起来像有一个更高层次的抽象可用,ThreadPoolExecutor
:
https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example
从这里也贴了方便的例子:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
# Retrieve a single page and report the url and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
还有map
我认为这使代码更容易:https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.Executor.map
Ray提供了一个优雅的方式来做到这一点(在Python 2和Python 3中)。雷是编写并行和分布式的Python库。
简单地定义与fetch
装饰的@ray.remote
功能。然后,你可以通过调用fetch.remote(url)
获取背景中的URL。
import ray
import sys
ray.init()
@ray.remote
def fetch(url):
if sys.version_info >= (3, 0):
import urllib.request
return urllib.request.urlopen(url).read()
else:
import urllib2
return urllib2.urlopen(url).read()
urls = ['https://en.wikipedia.org/wiki/Donald_Trump',
'https://en.wikipedia.org/wiki/Barack_Obama',
'https://en.wikipedia.org/wiki/George_W._Bush',
'https://en.wikipedia.org/wiki/Bill_Clinton',
'https://en.wikipedia.org/wiki/George_H._W._Bush']
# Fetch the webpages in parallel.
results = ray.get([fetch.remote(url) for url in urls])
如果你也想并行处理的网页,你可以直接把处理代码转换成fetch
,或者可以定义一个新的远程功能和撰写在一起。
@ray.remote
def process(html):
tokens = html.split()
return set(tokens)
# Fetch and process the pages in parallel.
results = []
for url in urls:
results.append(process.remote(fetch.remote(url)))
results = ray.get(results)
如果你有一个很长的,你要获取URL列表,你不妨发出一些任务,然后在以使他们完成处理它们。您可以使用ray.wait
做到这一点。
urls = 100 * urls # Pretend we have a long list of URLs.
results = []
in_progress_ids = []
# Start pulling 10 URLs in parallel.
for _ in range(10):
url = urls.pop()
in_progress_ids.append(fetch.remote(url))
# Whenever one finishes, start fetching a new one.
while len(in_progress_ids) > 0:
# Get a result that has finished.
[ready_id], in_progress_ids = ray.wait(in_progress_ids)
results.append(ray.get(ready_id))
# Start a new task.
if len(urls) > 0:
in_progress_ids.append(fetch.remote(urls.pop()))
抓取网页为你没有任何访问本地显然将需要一段时间。如果你有几个访问,您可以使用threading
模块同时运行一对夫妇。
这里是一个非常原始的例子
import threading
import urllib2
import time
urls = ['http://docs.python.org/library/threading.html',
'http://docs.python.org/library/thread.html',
'http://docs.python.org/library/multiprocessing.html',
'http://docs.python.org/howto/urllib2.html']
data1 = []
data2 = []
class PageFetch(threading.Thread):
def __init__(self, url, datadump):
self.url = url
self.datadump = datadump
threading.Thread.__init__(self)
def run(self):
page = urllib2.urlopen(self.url)
self.datadump.append(page.read()) # don't do it like this.
print "Starting threaded reads:"
start = time.clock()
for url in urls:
PageFetch(url, data2).start()
while len(data2) < len(urls): pass # don't do this either.
print "...took %f seconds" % (time.clock() - start)
print "Starting sequential reads:"
start = time.clock()
for url in urls:
page = urllib2.urlopen(url)
data1.append(page.read())
print "...took %f seconds" % (time.clock() - start)
for i,x in enumerate(data1):
print len(data1[i]), len(data2[i])
这是输出,当我跑它:
Starting threaded reads:
...took 2.035579 seconds
Starting sequential reads:
...took 4.307102 seconds
73127 19923
19923 59366
361483 73127
59366 361483
通过追加到一个列表拼抢从线程的数据可能是不明智的(队列会更好),但它说明是有区别的。