Python阅读网址：ChunkedEncodingError

Question

我正在使用 Python requests 库来打开 URL。该 URL 是指向 word 文档的 URL。我可以在浏览器上手动访问 URL，这会自动下载文档。我能够成功下载该文档。

但是，使用请求时，我收到了 ChunkedEncodingError。

我的代码：

import requests
url = 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
res = requests.get(url) 
print(res)

错误：

引发 ChunkedEncodingError(e) requests.exceptions.ChunkedEncodingError: ('连接中断: IncompleteRead(读取 16834 字节，预计还有 87102 字节)', IncompleteRead（已读取 16834 字节，预计还会有 87102 字节））

我也尝试过使用其他库，例如aiohttp和urllib3，但也会出现错误。

重试请求不起作用，因为我每次都会收到错误。

如果有人能提供帮助，那就太好了！其他一些帖子说这可能是服务器端问题。但它在浏览器上运行良好，更多技术细节超出了我的范围。

Answer 1

这当然是一个服务器端问题 - 即使使用

wget

也会发生这种情况，尽管

wget

（和你的浏览器）足够聪明，可以从失败的字节重试：

wget -vvv 'https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc'
--2024-01-24 16:05:40--  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Resolving legalref.judiciary.hk (legalref.judiciary.hk)... 118.143.43.114
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103936 (102K) [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                   14%[=================>                                                                                                       ]  15,12K  --.-KB/s    in 0s

2024-01-24 16:05:42 (31,4 MB/s) - Read error at byte 15486/103936 (Connection reset by peer). Retrying.

--2024-01-24 16:05:43--  (try: 2)  https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc
Connecting to legalref.judiciary.hk (legalref.judiciary.hk)|118.143.43.114|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 103936 (102K), 88450 (86K) remaining [application/msword]
Saving to: ‘CACC000213A_2008.doc’

CACC000213A_2008.doc                                  100%[++++++++++++++++++======================================================================================================>] 101,50K  25,8KB/s    in 3,3s

2024-01-24 16:05:50 (25,8 KB/s) - ‘CACC000213A_2008.doc’ saved [103936/103936]

您可以通过使用

requests.get(..., stream=True)

来实现类似的逻辑，查看您获得的

Content-Length

，并将其与您已成功写入的字节进行比较；如果您遇到异常并且您阅读的内容少于预期（通过

Content-Length

），请使用

Range: bytes={start_byte}-

样式标题重试：

import requests


def download_with_resume(sess: requests.Session, url: str) -> bytes:
    data = b""
    expected_length = None
    for attempt in range(10):
        if len(data) == expected_length:
            break
        if len(data):
            headers = {"Range": f"bytes={len(data)}-"}
            expected_status = 206
        else:
            headers = {}
            expected_status = 200
        print(f"{url}: got {len(data)} / {expected_length} bytes...")
        resp = sess.get(url, stream=True, headers=headers)
        resp.raise_for_status()
        if resp.status_code != expected_status:
            raise ValueError(f"Unexpected status code: {resp.status_code}")
        if expected_length is None:  # Only update this on the first request
            content_length = resp.headers.get("Content-Length")
            if not content_length:
                raise ValueError("Content-Length header not found")
            expected_length = int(content_length)

        try:
            for chunk in resp.iter_content(chunk_size=8192):
                data += chunk
        except requests.exceptions.ChunkedEncodingError:
            pass

    if len(data) != expected_length:
        raise ValueError(f"Expected {expected_length} bytes, got {len(data)}")

    return data


with requests.Session() as sess:
    data = download_with_resume(
        sess,
        url="https://legalref.judiciary.hk/doc/judg/word/vetted/other/ch/2008/CACC000213A_2008.doc",
    )
    print("=>", len(data))

Python阅读网址：ChunkedEncodingError

问题描述投票：0回答：1

1个回答

最新问题

Python阅读网址：ChunkedEncodingError

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1