无法使用Python打开Unicode URL

Question

使用Python 2.5.2和Linux Debian，我试图从西班牙语URL中获取包含西班牙语字符'í'的内容：

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url).read()

我收到这个错误：

UnicodeEncodeError：'ascii'编解码器无法对位置8中的字符u'\ xe1'进行编码：序数不在范围内（128）

我在尝试将url传递给urllib之前使用了：

url = urllib.quote(url)

还有这个：

url = url.encode('UTF-8')

但他们没有工作。

你能告诉我我做错了什么吗？

Answer 1

根据适用的标准RFC 1378，URL只能包含ASCII字符。好的解释here，我引用：

“......只有字母数字[0-9a-zA-Z]，特殊字符”$ -_。+！*'（），“[不包括引号 - ed]和用于保留目的的保留字符可能在URL中使用未编码的。“

正如我给出的URL解释的那样，这可能意味着你必须用'％ED'替换“带有急性重音的小写i”。

Answer 2

这对我有用：

#!/usr/bin/env python
# define source file encoding, see: http://www.python.org/dev/peps/pep-0263/
# -*- coding: utf-8 -*-

import urllib
url = u'http://example.com/índice.html'
content = urllib.urlopen(url.encode("UTF-8")).read()

Answer 3

将URL编码为utf-8应该有效。我想知道你的源文件是否被正确编码，以及解释器是否知道它。例如，如果您的python源文件保存为UTF-8，那么您应该拥有

# coding=UTF-8

作为第一或第二行。

import urllib
url = u'http://mydomain.es/índice.html'
content = urllib.urlopen(url.encode('utf-8')).read()

适合我。

编辑：另外，请注意交互式Python会话中的Unicode文本（无论是通过IDLE还是控制台）都充满了与编码相关的难题。在这些情况下，您应该使用Unicode文字（例如在您的情况下为\ u00ED）。

Answer 4

这个对我有用。确保您使用的是相当新版本的Python，并且您的文件编码是正确的。这是我的代码：

# -*- coding: utf-8 -*-
import urllib
url = u'http://mydomain.es/índice.html'
url = url.encode('utf-8')
content = urllib.urlopen(url).read()

（mydomain.es不存在，因此DNS查找失败，但到目前为止没有unicode问题。）

Answer 5

我现在有类似的情况。我正在尝试下载图片。我在JSON文件中从服务器检索URL。某些图像包含非ASCII字符。这会引发错误：

for image in product["images"]: 
    filename = os.path.basename(image) 
    filepath = product_path + "/" + filename 
    urllib.request.urlretrieve(image, filepath) # error!

UnicodeEncodeError：'ascii'编解码器无法对字符'\ xc7'进行编码...

我尝试过使用.encode("UTF-8")，但不能说它有所帮助：

# coding=UTF-8
import urllib
url = u"http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = url.encode("UTF-8")
urllib.request.urlretrieve(url, "D:\image-1.jpg")

这只会引发另一个错误：

TypeError：不能在类字节对象上使用字符串模式

然后我给了urllib.parse.quote(url)一个去：

import urllib
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.quote(url)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

再一次，这引发了另一个错误：

ValueError：未知网址类型：'http％3A // example.com / wp-content / uploads / 2018/09 /％C4％B0MAGE-1.png'

:的"http://..."也逃脱了，我认为这是问题的原因。

所以，我找到了一个解决方法。我只引用/转义路径，而不是整个URL。

import urllib.request
import urllib.parse
url = "http://example.com/wp-content/uploads/2018/09/İMAGE-1.png"
url = urllib.parse.urlparse(url)
url = url.scheme + "://" + url.netloc + urllib.parse.quote(url.path)
urllib.request.urlretrieve(url, "D:\image-1.jpg")

这就是URL的样子："http://example.com/wp-content/uploads/2018/09/%C4%B0MAGE-1.png"，现在我可以下载图像了。

无法使用Python打开Unicode URL

问题描述投票：9回答：5

5个回答

最新问题

无法使用Python打开Unicode URL

问题描述 投票：9回答：5

5个回答

最新问题

问题描述投票：9回答：5