我正在尝试从网站获取一些数据。然而它返回了我
incomplete read
。我试图获取的数据是一组巨大的嵌套链接。我在网上做了一些研究,发现这可能是由于服务器错误(之前的分块传输编码完成)
达到预期大小)。我还在这个link上找到了上述解决方法
但是,我不确定如何将其用于我的案例。以下是我正在编写的代码
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
请帮我解决这个问题。谢谢
link 只是一个执行 urllib 的 read() 函数的包装器,它可以为您捕获任何不完整的读取异常。如果您不想实现整个补丁,您总是可以在读取链接的地方添加一个 try/catch 循环。例如:
try:
page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
page = e.partial
对于python3
try:
page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
page = e.partial
我在我的案例中发现:发送 HTTP/1.0 请求,添加这个,解决问题。
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
我提出要求后:
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
在我回到 http 1.1 之后(对于支持 1.1 的连接):
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
技巧是使用 http 1.0 而不是默认的 http/1.1
http 1.1 可以处理块,但由于某种原因网络服务器不能处理,所以我们在 http 1.0 中执行请求对于Python3,它会告诉你
ModuleNotFoundError:没有名为“httplib”的模块然后尝试使用 http.client 模块即可解决问题
import http.client as http
http.HTTPConnection._http_vsn = 10
http.HTTPConnection._http_vsn_str = 'HTTP/1.0'
try:
requestObj = urllib.request.urlopen(url, data)
responseJSON=""
while True:
try:
responseJSONpart = requestObj.read()
except http.client.IncompleteRead as icread:
responseJSON = responseJSON + icread.partial.decode('utf-8')
continue
else:
responseJSON = responseJSON + responseJSONpart.decode('utf-8')
break
return json.loads(responseJSON)
except Exception as RESTex:
print("Exception occurred making REST call: " + RESTex.__str__())
requests
代替
urllib2
。
requests
基于
urllib3
,所以很少有任何问题。循环起来试3次,就会强很多。您可以这样使用它:
import requests
msg = None
for i in [1,2,3]:
try:
r = requests.get(self.crawling, timeout=30)
msg = r.text
if msg: break
except Exception as e:
sys.stderr.write('Got error when requesting URL "' + self.crawling + '": ' + str(e) + '\n')
if i == 3 :
sys.stderr.write('{0.filename}@{0.lineno}: Failed requesting from URL "{1}" ==> {2}\n'. format(inspect.getframeinfo(inspect.currentframe()), self.crawling, e))
raise e
time.sleep(10*(i-1))
from urllib import request
import http.client
import os
url = 'http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brand'
try:
response = request.urlopen(url)
file = response.read()
except http.client.IncompleteRead as e:
file = e.partial
except Exception as result:
print("Unkonw error" + str(result))
return
# save file
with open(file_path, 'wb') as f:
print("save -> %s " % file_path)
f.write(file)
from http.client import IncompleteRead
from http.client import HTTPConnection
import urllib.request
def download_file(request, unsafe=False):
bytes_ranges_supported = False
http_1_0_tolerate = False
return_unsafe = False
return_raw = b''
# Check if is supported bytes ranges
try:
with urllib.request.urlopen(request) as response:
if response.headers.get('Accept-Ranges') == 'bytes':
bytes_ranges_supported = True
except:
pass
# Check if is supported HTTP/1.0
try:
with urllib.request.urlopen(request) as response:
if response.status == 206:
http_1_0_tolerate = True
except:
pass
# Unsafe mode
if bytes_ranges_supported:
while True:
try:
request.add_header('Range', 'bytes=%d-' % len(return_raw))
with urllib.request.urlopen(request) as response:
return_raw += response.read()
return return_raw
except IncompleteRead as e:
return_raw += e.partial
# Alternative mode
elif http_1_0_tolerate:
HTTPConnection._http_vsn = 10
HTTPConnection._http_vsn_str = 'HTTP/1.0'
# Unsafe mode
elif unsafe:
return_unsafe = True
# Legacy mode
if return_raw == b'':
with urllib.request.urlopen(request) as response:
try:
return_raw = response.read()
except IncompleteRead as e:
if return_unsafe:
return_raw = e.partial
else:
raise e
return return_raw
url = 'https://google.com'
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'})
with open('google.html', 'wb') as f:
f.write(download_file(req))
conn = http.client.HTTPConnection('www.google.com')
conn.request('GET', '/')
r1 = conn.getresponse()
page = r1.read().decode('utf-8')
这每次都能完美运行,而对于 urllib,它每次都会返回一个不完整的读取异常。
就像
try:
r = requests.get(url, timeout=timeout)
except (requests.exceptions.ChunkedEncodingError, requests.ConnectionError) as e:
logging.error("There is a error: %s" % e)