我使用 Visual Studio Code。 Python 版本 3.12.2。美丽汤版本4.12.3。我使用的是 Windows 11。文件编码设置为:utf-8。
这是我在 VS code 中的代码示例:
import requests
import urllib.parse
from urllib.parse import quote
from bs4 import BeautifulSoup
for topic in range(13717, 13718):
url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
print(url)
html_content = requests.get(url)
soup = BeautifulSoup(html_content.text, 'html.parser')
print(url)
会生成具有正确主题编号 (13717) 的构造 url:
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
这是正确的,也是我想要的。
但这就是问题所在,我经常收到“UnicodeDecodeError:'utf-8'编解码器无法解码位置 64 中的字节 0xe9:无效的连续字节”
问题是,一旦执行
html_content = requests.get(url)
语句,url 似乎就会更改为:
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=20
我可以通过将构建的网址 (
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
) 粘贴到网络浏览器中来检查,当我按 ENTER 键时,它会发生变化并添加短语:-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC
如您所见,字符 é
和 è
分别替换为 %E9
和 %E8
。结果是错误消息 UnicodeDecodeError。问题是:
我怎样才能避免或错误地陷入这个问题?
额外信息,如果网址中存在特殊字符,我不会在正手上拒绝。
这是完整的错误消息:
PS C:\xampp\htdocs\python> python dumpy.py
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
Traceback (most recent call last):
File "C:\xampp\htdocs\python\dumpy.py", line 10, in <module>
html_content = requests.get(url)
^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 73, in get
return request("get", url, params=params, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 725, in send
history = [resp for resp in gen]
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 175, in resolve_redirects
url = self.get_redirect_target(resp)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 124, in get_redirect_target
return to_native_string(location, "utf8")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\_internal_utils.py", line 33, in to_native_string
out = string.decode(encoding)
^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte
PS C:\xampp\htdocs\python>
使用 urllib.request 获取重定向 URL — 用于打开 URL 的可扩展库,请参阅下面的
final_url
:
import requests
import urllib.parse
from urllib.parse import quote,unquote
import urllib.request
from bs4 import BeautifulSoup
for topic in range(13717, 13718):
url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
print(url)
with urllib.request.urlopen(url) as cm:
final_url = cm.geturl()
print(cm.headers.get_content_charset()) # iso-8859-1
print(final_url)
print(unquote(final_url,encoding = 'iso-8859-1'))
html_content = requests.get(final_url)
soup = BeautifulSoup(html_content.text, 'html.parser')
print(type(soup))
所有
print
仅用于调试目的。
输出:
.\SO\78094322.py
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
iso-8859-1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-étagère-Team-Associated-RC10CC&pp=1
<class 'bs4.BeautifulSoup'>