Python阅读网址:ChunkedEncodingError

问题描述 投票:0回答:1

我正在使用 Python requests 库来打开 URL。该 URL 是指向 word 文档的 URL。我可以在浏览器上手动访问 URL,这会自动下载文档。我能够成功下载该文档。

但是,使用请求时,我收到了 ChunkedEncodingError。

我的代码:

import requests
url = 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
res = requests.get(url) 
print(res)

错误:

引发 ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('连接中断: IncompleteRead(读取 16834 字节,预计还有 87102 字节)', IncompleteRead(已读取 16834 字节,预计还会有 87102 字节))

我也尝试过使用其他库,例如aiohttp和urllib3,但也会出现错误。

重试请求不起作用,因为我每次都会收到错误。

如果有人能提供帮助,那就太好了!其他一些帖子说这可能是服务器端问题。但它在浏览器上运行良好,更多技术细节超出了我的范围。

python python-requests chunked-encoding
1个回答
0
投票

这当然是一个服务器端问题 - 即使使用

wget
也会发生这种情况,尽管
wget
(和你的浏览器)足够聪明,可以从失败的字节重试:

wget -vvv 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
--2024-01-24 16:05:40--  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Resolving legalref.judiciary.hk (legalref.judiciary.hk)... 118.143.43.114
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103936 (102K) [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                   14%[=================>                                                                                                       ]  15,12K  --.-KB/s    in 0s

2024-01-24 16:05:42 (31,4 MB/s) - Read error at byte 15486/103936 (Connection reset by peer). Retrying.

--2024-01-24 16:05:43--  (try: 2)  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 103936 (102K), 88450 (86K) remaining [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                  100%[++++++++++++++++++======================================================================================================>] 101,50K  25,8KB/s    in 3,3s

2024-01-24 16:05:50 (25,8 KB/s) - ‘CACC000213A_2008.doc’ saved [103936/103936]

您可以通过使用

requests.get(..., stream=True)
来实现类似的逻辑,查看您获得的
Content-Length
,并将其与您已成功写入的字节进行比较;如果您遇到异常并且您阅读的内容少于预期(通过
Content-Length
),请使用
Range: bytes={start_byte}-
样式标题重试:

import requests


def download_with_resume(sess: requests.Session, url: str) -> bytes:
    data = b""
    expected_length = None
    for attempt in range(10):
        if len(data) == expected_length:
            break
        if len(data):
            headers = {"Range": f"bytes={len(data)}-"}
            expected_status = 206
        else:
            headers = {}
            expected_status = 200
        print(f"{url}: got {len(data)} / {expected_length} bytes...")
        resp = sess.get(url, stream=True, headers=headers)
        resp.raise_for_status()
        if resp.status_code != expected_status:
            raise ValueError(f"Unexpected status code: {resp.status_code}")
        if expected_length is None:  # Only update this on the first request
            content_length = resp.headers.get("Content-Length")
            if not content_length:
                raise ValueError("Content-Length header not found")
            expected_length = int(content_length)

        try:
            for chunk in resp.iter_content(chunk_size=8192):
                data += chunk
        except requests.exceptions.ChunkedEncodingError:
            pass

    if len(data) != expected_length:
        raise ValueError(f"Expected {expected_length} bytes, got {len(data)}")

    return data


with requests.Session() as sess:
    data = download_with_resume(
        sess,
        url="https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc",
    )
    print("=>", len(data))
© www.soinside.com 2019 - 2024. All rights reserved.