Getting ValueError:未知的URL类型:h

问题描述 投票:0回答:1

我正在尝试通过多处理功能对某些网址进行重复测试,但会出错。我猜是因为当我调用函数时,它从url的起始字母开始将url读取为h。我正在寻求您的帮助。

这是我的代码:

from urllib2 import urlopen
import hashlib
from multiprocessing import Pool

def find_matches(urls):

    d = {}
    url_contents = {}
    matches = []
    for url in urls:
        c = urlopen(url)
        url_contents[url] = []
        while 1:
            r = c.read(4096)
            if not r: break
            md5 = hashlib.md5(r).hexdigest()
            url_contents[url].append(md5)
            if md5 in d:
                url2 = d[md5]
                matches.append((md5, url, url2))
            else:
                d[md5] = []
            d[md5].append(url)
    print ("This urls has duplicates: ", matches)


p = Pool(4)
print(p.map(find_matches, [ "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]))

我的错误:

Traceback (most recent call last):
  File "hata.py", line 28, in <module>
    print(p.map(find_matches, [ "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]))
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 253, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 572, in get
    raise self._value
ValueError: unknown url type: h

python multiprocessing urllib2
1个回答
0
投票

您的代码有问题。首先,您的函数find_matches根据您的迭代方式需要一个URL列表。它使整个事情变得不平行。

修复很简单,您需要传递函数find_matches的列表

print(p.map(find_matches, [["http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/tr/9/98/Mu%C4%9Fla_S%C4%B1tk%C4%B1_Ko%C3%A7man_%C3%9Cniversitesi_logo.png", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg", "http://wiki.netseclab.mu.edu.tr/images/thumb/f/f7/MSKU-BlockchainResearchGroup.jpeg/300px-MSKU-BlockchainResearchGroup.jpeg", "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Hawai%27i.jpg/1024px-Hawai%27i.jpg "]]))

恕我直言,这不是您应该使用多重处理的情况,因为您的函数需要相同的URL集。

但是,如果您的程序同时需要多个设置,那么这很有意义。

print(p.map(find_matches, [[url1, url2], [url3, url4]]))

这种方式将对2组网址进行并行处理。

© www.soinside.com 2019 - 2024. All rights reserved.