如何处理 IncompleteRead:在 python 中

问题描述 投票:0回答:9

我正在尝试从网站获取一些数据。然而它返回了我

incomplete read
。我试图获取的数据是一组巨大的嵌套链接。我在网上做了一些研究,发现这可能是由于服务器错误(之前的分块传输编码完成) 达到预期大小)。我还在这个link

上找到了上述解决方法

但是,我不确定如何将其用于我的案例。以下是我正在编写的代码

br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)

for tag in links:
    name = tag['alt']
    tag['url'] = urlparse.urljoin(urls, tag['url'])
    r = br.open(tag['url'])
    page_child = br.response().read()
    soup_child = BeautifulSoup(page_child)
    contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
    data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
    print contracts
    print data_usage

请帮我解决这个问题。谢谢

python python-2.7 web-scraping beautifulsoup mechanize
9个回答
29
投票
您在问题中包含的

link 只是一个执行 urllib 的 read() 函数的包装器,它可以为您捕获任何不完整的读取异常。如果您不想实现整个补丁,您总是可以在读取链接的地方添加一个 try/catch 循环。例如:

try: page = urllib2.urlopen(urls).read() except httplib.IncompleteRead, e: page = e.partial

对于python3

try: page = request.urlopen(urls).read() except (http.client.IncompleteRead) as e: page = e.partial
    

10
投票
注意此答案仅适用于 Python 2(于 2013 年发布)

我在我的案例中发现:发送 HTTP/1.0 请求,添加这个,解决问题。

import httplib httplib.HTTPConnection._http_vsn = 10 httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
我提出要求后:

req = urllib2.Request(url, post, headers) filedescriptor = urllib2.urlopen(req) img = filedescriptor.read()
在我回到 http 1.1 之后(对于支持 1.1 的连接):

httplib.HTTPConnection._http_vsn = 11 httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
技巧是使用 http 1.0 而不是默认的 http/1.1
http 1.1 可以处理块,但由于某种原因网络服务器不能处理,所以我们在 http 1.0 中执行请求

对于Python3,它会告诉你

ModuleNotFoundError:没有名为“httplib”的模块

然后尝试使用 http.client 模块即可解决问题

import http.client as http http.HTTPConnection._http_vsn = 10 http.HTTPConnection._http_vsn_str = 'HTTP/1.0'
    

8
投票
对我有用的是捕获 IncompleteRead 作为异常,并通过将其放入如下所示的循环中来收集您在每次迭代中设法读取的数据:(注意,我使用的是 Python 3.4.1,urllib 库在 2.7 和 2.7 之间发生了变化3.4)

try: requestObj = urllib.request.urlopen(url, data) responseJSON="" while True: try: responseJSONpart = requestObj.read() except http.client.IncompleteRead as icread: responseJSON = responseJSON + icread.partial.decode('utf-8') continue else: responseJSON = responseJSON + responseJSONpart.decode('utf-8') break return json.loads(responseJSON) except Exception as RESTex: print("Exception occurred making REST call: " + RESTex.__str__())
    

1
投票
您可以使用

requests

 代替 
urllib2
requests
基于
urllib3
,所以很少有任何问题。循环起来试3次,就会强很多。您可以这样使用它:

import requests msg = None for i in [1,2,3]: try: r = requests.get(self.crawling, timeout=30) msg = r.text if msg: break except Exception as e: sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n') if i == 3 : sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'. format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e)) raise e time.sleep(10*(i-1))
    

1
投票
python3 仅供参考

from urllib import request import http.client import os url = 'http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brand' try: response = request.urlopen(url) file = response.read() except http.client.IncompleteRead as e: file = e.partial except Exception as result: print("Unkonw error" + str(result)) return # save file with open(file_path, 'wb') as f: print("save -> %s " % file_path) f.write(file)
    

0
投票
我发现我的病毒检测器/防火墙导致了这个问题。 AVG的“在线盾”部分。


0
投票
这救了我的命,这都是关于字节范围的,下载从最后一个坏字节继续,显然可以进一步改进。

from http.client import IncompleteRead from http.client import HTTPConnection import urllib.request def download_file(request, unsafe=False): bytes_ranges_supported = False http_1_0_tolerate = False return_unsafe = False return_raw = b'' # Check if is supported bytes ranges try: with urllib.request.urlopen(request) as response: if response.headers.get('Accept-Ranges') == 'bytes': bytes_ranges_supported = True except: pass # Check if is supported HTTP/1.0 try: with urllib.request.urlopen(request) as response: if response.status == 206: http_1_0_tolerate = True except: pass # Unsafe mode if bytes_ranges_supported: while True: try: request.add_header('Range', 'bytes=%d-' % len(return_raw)) with urllib.request.urlopen(request) as response: return_raw += response.read() return return_raw except IncompleteRead as e: return_raw += e.partial # Alternative mode elif http_1_0_tolerate: HTTPConnection._http_vsn = 10 HTTPConnection._http_vsn_str = 'HTTP/1.0' # Unsafe mode elif unsafe: return_unsafe = True # Legacy mode if return_raw == b'': with urllib.request.urlopen(request) as response: try: return_raw = response.read() except IncompleteRead as e: if return_unsafe: return_raw = e.partial else: raise e return return_raw url = 'https://google.com' req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'}) with open('google.html', 'wb') as f: f.write(download_file(req))
    

-1
投票
我尝试了所有这些解决方案,但没有一个对我有用。实际上,我没有使用 urllib,而是使用了 http.client (Python 3)

conn = http.client.HTTPConnection('www.google.com') conn.request('GET', '/') r1 = conn.getresponse() page = r1.read().decode('utf-8')

这每次都能完美运行,而对于 urllib,它每次都会返回一个不完整的读取异常。


-2
投票
我只是添加一个更多的例外来解决这个问题。

就像

try: r = requests.get(url, timeout=timeout) except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e: logging.error("There is a error: %s" % e)
    
© www.soinside.com 2019 - 2024. All rights reserved.