使用 Python tarfile 解压缩大流

Question

我有一个大的

.tar.xz

文件，我正在使用 python 请求下载该文件，需要在写入磁盘之前解压缩（由于磁盘空间有限）。我有一个适用于较小文件的解决方案，但较大的文件会无限期挂起。

import io
import requests
import tarfile
session = requests.Session()
response = session.get(url, stream=True)

compressed_data = io.BytesIO(response.content)
tar = tarfile.open(mode='r|*' ,fileobj=compressed_data, bufsize=16384)
tar.extractall(path='/path/')

对于较大的文件，它会挂在

io.BytesIO

处。

有没有办法将流传递给

fileobj

而不读取整个流？或者有更好的方法吗？

Answer 1

您应该使用

lzma

库来解压缩

.xz

文件。分块下载大文件（以提高内存效率）并解压缩它们，然后写入磁盘。这是我在服务器上使用的脚本，用于每周一次下载大型

tar.xz

，文件大小通常约为 6GB。这应该也适合你。

import requests
import lzma
import tarfile
import os
import tempfile

url = 'your tar.xz url'

with requests.get(url, stream=True) as response:
    response.raise_for_status()

    # Initialize LZMA decompressor
    decompressor = lzma.LZMADecompressor()

    # Create a temporary file to store the decompressed data
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        for chunk in response.iter_content(chunk_size=32 * 1024):
            data = decompressor.decompress(chunk)
            tmp_file.write(data)

        # Get the name of the temporary file
        tmp_file_name = tmp_file.name

# Now extract from the temporary file
with tarfile.open(tmp_file_name, mode="r") as tar:
    tar.extractall(path="/home/arrafi/")

# Clean up the temporary file
os.remove(tmp_file_name)

chunk_size=32 * 1024

根据您的规格修改块大小。

现在，如果你坚持使用

io

，请根据下载和分块解压来修改你的代码。您的代码挂起是因为它试图一次下载所有内容，因此内存不足。要下载大文件，必须分块下载文件以提高内存效率。

Answer 2

使用

response.iter_content

将

tar.xz

下载流式传输到块中，这些块将按 LZMADecompressor

 增量解压到内存中

并传递到缓冲区。一旦下载终止，缓冲区中包含的tar

存档将被

提取。

from io import BytesIO
from lzma import LZMADecompressor, FORMAT_XZ
import tarfile
import requests
import contextlib


url = '' #                                 url of the data
chunk_size = 2**14 #                       just an example
path_tar_archive_dir = 'downloaded_tar' #  location of the extraction of the archive


with requests.Session() as session:
    response = session.get(url, stream=True)

    # create buffer to store streaned data
    with BytesIO() as bw:
        # xz-decompressor
        d = LZMADecompressor(format=FORMAT_XZ) # also without arguments

        # read incoming data
        for chunk in response.iter_content(chunk_size=chunk_size):
            # in-memory automatic incremental decompression
            data = d.decompress(chunk)
            if data:
                bw.write(data)
            
        if not d.eof:
            raise Exception('EOF of streaming data not reached')
        
        print('[OK] Download and xz-decompression of stream data')
        # set stream position at the start
        bw.seek(0)
        
        # temporary shutdown tar-stdout
        f = open(os.devnull, 'w')
        with contextlib.redirect_stdout(f):

            # extract the archieve from the buffer
            with tarfile.open(fileobj=bw, mode='r') as tar:
                # extract the archieve at the given directory
                tar.extractall(path_tar_archieve_dir, None, numeric_owner=False)
            
        print('[OK] Extraction tar-archieve')

请注意，

tar.extractall

的签名将支持 Python 3.12 中的新 kewordonly 参数 filter

。

性能注意事项：

tar

将其所有输出发送到

stdout

并将其重定向到

devnull

，提取的性能将显着提高！

我设置了一个

Flask

 服务器来测试

localhost 上的代码。我使用 34 MB 的 tar.xz 存档，需要一段时间才能完全下载它。如果不考虑性能技巧，它将需要大量 RAM，在我的例子中高达 200 MB。相反，如果重定向到 devnull

，它的执行在时间和 RAM 方面将很难被注意到。

这是我的测试

server.py

（仅用于本地主机测试！）

"""
# start the server
$ flask --app server run --debug
"""
from flask import Flask
from flask import send_from_directory
import os


app = Flask(__name__)


@app.route('/archive/', methods=['GET', 'POST'])
def archive():
    # create route http://127.0.0.1:5000/archive/
    
    abs_path_to_archive = # <- here add the path!
    dir_path, basename = os.path.split(abs_path_to_archive)

    return send_from_directory(dir_path, basename, as_attachment=False)

然后在ins自己的终端启动服务器

$ flask --app server run --debug

并运行上述程序，在另一个终端中使用

url = "http://127.0.0.1:5000/archive/"

 下载存档。

使用 Python tarfile 解压缩大流

问题描述投票：0回答：2

2个回答

最新问题

使用 Python tarfile 解压缩大流

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2