如何捕获 urllib.urlretrieve 中的 404 错误

Question

背景：我使用

urllib.urlretrieve

，而不是

urllib*

模块中的任何其他函数，因为它支持钩子函数（参见下面的

reporthook

）.. 用于显示文本进度条。这是 Python >=2.6。

>>> urllib.urlretrieve(url[, filename[, reporthook[, data]]])

但是，

urlretrieve

太愚蠢了，以至于无法检测HTTP请求的状态（例如：是404还是200？）。

>>> fn, h = urllib.urlretrieve('http://google.com/foo/bar')
>>> h.items() 
[('date', 'Thu, 20 Aug 2009 20:07:40 GMT'),
 ('expires', '-1'),
 ('content-type', 'text/html; charset=ISO-8859-1'),
 ('server', 'gws'),
 ('cache-control', 'private, max-age=0')]
>>> h.status
''
>>>

下载带有钩子支持（显示进度条）和良好的 HTTP 错误处理的远程 HTTP 文件的最著名方法是什么？

Answer 1

查看

urllib.urlretrieve

的完整代码：

def urlretrieve(url, filename=None, reporthook=None, data=None):
  global _urlopener
  if not _urlopener:
    _urlopener = FancyURLopener()
  return _urlopener.retrieve(url, filename, reporthook, data)

换句话说，您可以使用 urllib.FancyURLopener （它是公共 urllib API 的一部分）。您可以覆盖

http_error_default

来检测 404：

class MyURLopener(urllib.FancyURLopener):
  def http_error_default(self, url, fp, errcode, errmsg, headers):
    # handle errors the way you'd like to

fn, h = MyURLopener().retrieve(url, reporthook=my_report_hook)

Answer 2

您应该使用：

import urllib2

try:
    resp = urllib2.urlopen("http://www.google.com/this-gives-a-404/")
except urllib2.URLError, e:
    if not hasattr(e, "code"):
        raise
    resp = e

print "Gave", resp.code, resp.msg
print "=" * 80
print resp.read(80)

编辑： 这里的基本原理是，除非您期望出现异常状态，否则它的发生就是一个例外，您可能甚至没有想到这一点 - 所以不要让您的代码继续运行如果不成功，默认行为是——相当明智——抑制其执行。

Answer 3

URL Opener 对象的“retreive”方法支持reporthook，并在 404 上抛出异常。

http://docs.python.org/library/urllib.html#url-opener-objects

Answer 4

这是 Python 3 的更新答案（以下在 Python 3.10 中测试的解决方案）：

urlretrieve 太愚蠢了，以至于它无法检测 HTTP 请求的状态（例如：是 404 还是 200？）。

如果对 urlretrieve 的调用没有触发任何异常，则可能是 200 OK 或另一个 20X 代码。之前可能有重定向； Python 3.10 中的行为仍然相同。
对于错误，您可以将它们捕获为 HTTPError 对象，其中包含错误代码。这样您就可以确定它是 404 还是其他错误代码。
您可以为其他情况添加另一个通用例外条款：写入权限、无效的目标文件名...

from urllib.error import HTTPError
from urllib.request import urlretrieve

try:
    local_file, headers = urlretrieve("https://google.com/thats/a/404/url", "./output.txt")
except HTTPError as e:
    print("Error, HTTP Code is {}".format(e.code))
    if e.code == 404:
        print("This is a 404 NOT FOUND")
except Exception as e:
    print("Another exception which is not an HTTP error was raised. (Invalid destination filename, innapropriate permissions... etc)")

下载带有钩子支持（显示进度条）和良好的 HTTP 错误处理的远程 HTTP 文件的最著名方法是什么？

如果您想要进度条和更好的错误处理，则更简单的是删除 urlretrieve，并使用

requests

和

tqdm

代替：

pip install tqdm
pip install requests


import requests
from tqdm import tqdm

def download(url: str, fname: str, chunk_size=1024):
    resp = requests.get(url, stream=True)
    total = int(resp.headers.get('content-length', 0))
    with open(fname, 'wb') as file, tqdm(
            desc=fname,
            total=total,
            unit='iB',
            unit_scale=True,
            unit_divisor=1024,
    ) as bar:
        for data in resp.iter_content(chunk_size=chunk_size):
            size = file.write(data)
            bar.update(size)
    return resp.status_code

code = download("https://google.com/not_a_valid_file.txt", "output2.txt")

if code == 404:
    print("The file was not found")

此代码是以下代码的修改版本：https://gist.github.com/yanqd0/c13ed29e29432e3cf3e7c38467f42f51来自github用户yanqd0

如何捕获 urllib.urlretrieve 中的 404 错误

问题描述投票：0回答：4

4个回答

最新问题

如何捕获 urllib.urlretrieve 中的 404 错误

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4