Python,在 requests.get() 中执行时 url 会发生变化,并导致著名的 UnicodeDecodeError

问题描述 投票:0回答:1

我使用 Visual Studio Code。 Python 版本 3.12.2。美丽汤版本4.12.3。我使用的是 Windows 11。文件编码设置为:utf-8。

这是我在 VS code 中的代码示例:

import requests
import urllib.parse
from urllib.parse import quote

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    html_content = requests.get(url)
    soup = BeautifulSoup(html_content.text, 'html.parser')

    

print(url)
会生成具有正确主题编号 (13717) 的构造 url:
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
这是正确的,也是我想要的。

但这就是问题所在,我经常收到“UnicodeDecodeError:'utf-8'编解码器无法解码位置 64 中的字节 0xe9:无效的连续字节”

问题是,一旦执行

html_content = requests.get(url)
语句,url 似乎就会更改为:
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=20

我可以通过将构建的网址 (

https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
) 粘贴到网络浏览器中来检查,当我按 ENTER 键时,它会发生变化并添加短语:
-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC
如您所见,字符
é
è
分别替换为
%E9
%E8
。结果是错误消息 UnicodeDecodeError。问题是: 我怎样才能避免或错误地陷入这个问题? 额外信息,如果网址中存在特殊字符,我不会在正手上拒绝。

这是完整的错误消息:

PS C:\xampp\htdocs\python> python dumpy.py
https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
Traceback (most recent call last):
  File "C:\xampp\htdocs\python\dumpy.py", line 10, in <module>
    html_content = requests.get(url)
                   ^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 589, in request     
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 725, in send        
    history = [resp for resp in gen]
              ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 175, in resolve_redirects
    url = self.get_redirect_target(resp)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\sessions.py", line 124, in get_redirect_target
    return to_native_string(location, "utf8")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\bartz\AppData\Roaming\Python\Python312\site-packages\requests\_internal_utils.py", line 33, in to_native_string
    out = string.decode(encoding)
          ^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 64: invalid continuation byte
PS C:\xampp\htdocs\python>
python url decode encode
1个回答
0
投票

使用 urllib.request 获取重定向 URL — 用于打开 URL 的可扩展库,请参阅下面的

final_url

import requests
import urllib.parse
from urllib.parse import quote,unquote
import urllib.request

from bs4 import BeautifulSoup

for topic in range(13717, 13718):
    url = 'https://www.scale-rc-car.com/forum/showthread.php?t='+str(topic) +'&pp=1&page=1'
    print(url)
    with urllib.request.urlopen(url) as cm:
        final_url = cm.geturl()
        print(cm.headers.get_content_charset())       # iso-8859-1
    print(final_url)
    print(unquote(final_url,encoding = 'iso-8859-1'))
    html_content = requests.get(final_url)
    soup = BeautifulSoup(html_content.text, 'html.parser')
    print(type(soup))

所有

print
仅用于调试目的。

输出

.\SO\78094322.py

https://www.scale-rc-car.com/forum/showthread.php?t=13717&pp=1&page=1
iso-8859-1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-%E9tag%E8re-Team-Associated-RC10CC&pp=1
https://www.scale-rc-car.com/forum/showthread.php?13717-Buggy-d-étagère-Team-Associated-RC10CC&pp=1
<class 'bs4.BeautifulSoup'>
© www.soinside.com 2019 - 2024. All rights reserved.