我正在尝试通过多处理功能对某些网址进行重复测试,但会出错。我猜是因为当我调用函数时,它从url的起始字母开始将url读取为h。我正在寻求您的帮助。
这是我的代码:
from urllib2 import urlopen
import hashlib
from multiprocessing import Pool
def find_matches(urls):
d = {}
url_contents = {}
matches = []
for url in urls:
c = urlopen(url)
url_contents[url] = []
while 1:
r = c.read(4096)
if not r: break
md5 = hashlib.md5(r).hexdigest()
url_contents[url].append(md5)
if md5 in d:
url2 = d[md5]
matches.append((md5, url, url2))
else:
d[md5] = []
d[md5].append(url)
print ("This urls has duplicates: ", matches)
p = Pool(4)
print(p.map(find_matches, [ "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]))
我的错误:
Traceback (most recent call last):
File "hata.py", line 28, in <module>
print(p.map(find_matches, [ "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]))
File "/usr/lib/python2.7/multiprocessing/pool.py", line 253, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
raise self._value
ValueError: unknown url type: h
您的代码有问题。首先,您的函数find_matches
根据您的迭代方式需要一个URL列表。它使整个事情变得不平行。
修复很简单,您需要传递函数find_matches
的列表
print(p.map(find_matches, [["http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]]))
恕我直言,这不是您应该使用多重处理的情况,因为您的函数需要相同的URL集。
但是,如果您的程序同时需要多个设置,那么这很有意义。
print(p.map(find_matches, [[url1, url2], [url3, url4]]))
这种方式将对2组网址进行并行处理。