我有一个100,000个以上网址的文本文件。我要执行所有这些命令,并评估它们对某些文本的响应。我当前的代码可以执行此操作,但需要几个小时才能完成。
这是我当前的代码
text = "stackoverflow"
urls = open("urls.txt").read().splitlines()
def fetch_url(url):
try:
response = urlopen(url, timeout=2)
return url, response.read(), None
except Exception as e:
return url, None, e
try:
results = ThreadPool(300).imap_unordered(fetch_url, urls)
except:
pass
for url, html, error in results:
if error is None:
if text.encode() in html:
print("Found in " + url)
else:
print("error %r: %s" % (url, error))
我不认为可以,因为这取决于您的互联网速度。但是结果列表的大小迅速增加,这不是很好的内存管理因此我认为您应该这样做。
urls = open("urls.txt").read().splitlines()
def fetch_url(url):
try:
response = urlopen(url, timeout=2)
return url, response.read(), None
except Exception as e:
return url, None, e
def check_url(url):
text = "stackoverflow"
url, html, error = fetch_url(url)
if error is None:
if text.encode() in html:
print("Found in " + url)
else:
print("error %r: %s" % (url, error))
try:
ThreadPool(300).imap_unordered(check_url, urls)
except:
pass
这样,您就不会以很大的列表结尾。