从位于 s3 上的大型 gzip 存档中解压缩前 N 个字节和后 N 个字节，无需下载和解压缩整个文件

Question

我有一个真正的用例，可以使用 Python 遍历 s3 存储桶，并从大型 (40GB) gzip 存档中提取页眉和页脚，而无需解压缩整个文件。

代码实际上按预期工作，但仅适用于标头：


def get_first_line_from_s3(bucket_name, file_name):
s3 = boto3.client('s3')

    chunk_size = 1024
    
    range_header = f"bytes=0-{chunk_size}"
    response = s3.get_object(Bucket=bucket_name, Key=file_name, Range=range_header)
    
    content = response['Body'].read()
    
    decompressor = zlib.decompressobj(zlib.MAX_WBITS | 32)
    
    first_line_decompressed = decompressor.decompress(content)
    
    return first_line_decompressed.decode().split('\n')[0]

但是稍微修改一下读取预告片就会抛出错误

页脚代码

def get_last_line_from_s3(bucket_name, file_name):
s3 = boto3.client('s3')

    chunk_size = 1024
    
    response = s3.head_object(Bucket=bucket_name, Key=file_key)
    file_size = response['ContentLength']
    
    range_header=f'bytes={file_size - 1024}-{file_size}'
    response_for_last_bytes = s3.get_object(Bucket=bucket_name, Key=file_name, Range=range_header)
    last_n_bytes = response_for_last_bytes['Body'].read()
    
    decompressor = zlib.decompressobj(zlib.MAX_WBITS | 32)
    
    last_line_decompressed = decompressor.decompress(last_n_bytes)
    
    print (last_line_decompressed)

zlib.error：解压缩数据时出现错误-3：标头检查不正确

但是我尝试了 gzip.open() ，代码实际上适用于页眉和页脚，但在本地：

def get_first_and_last_line(path):
   with open(path) as f:
      with gzip.open(f) as f_gzip:
         header_chunk = 1024
         f_gzip.seek(0, 0)
         header_raw = f_gzip.read(header_chunk).decode()
         print(header_raw)

         footer_chunk = 128
         f_gzip.seek(-footer_chunk, 2)
         footer_raw = f_gzip.read(footer_chunk).decode()
         print(footer_raw)

也适用于 s3（用 s3fs.S3Filesystem().open 替换 open 后），但有一个小警告，它会解压缩整个文件，这违背了我的主要目标：

def get_last_line_from_s3(s3_path):
   fs = s3fs.S3FileSystem()
      with fs.open(s3_path) as f:
         with gzip.open(f) as f_gzip:
            header_chunk = 1024
            f_gzip.seek(0, 0)
            header_raw = f_gzip.read(header_chunk).decode()
            print(header_raw)

            footer_chunk = 128
            f_gzip.seek(-footer_chunk, 2)
            footer_raw = f_gzip.read(footer_chunk).decode()
            print(footer_raw)

我需要解压缩 s3 上文件的前 N 个字节和后 N 个字节，而不下载或解压缩整个文件。

Answer 1

您的“代码实际上适用于页眉和页脚，但在本地”也正在解压缩整个 gzip 文件以打印页脚。没有办法避免这种情况。至少之前没有解压整个 gzip 文件并在末尾附近保存入口点，或者没有以特殊方式压缩数据来创建入口点。

正常构建的 gzip 文件只能通过读取整个文件来解压缩。

通常不需要下载并保存整个内容。 gzip 文件可以作为流解压缩。您可以一次读取一大块并边读边解压。

从位于 s3 上的大型 gzip 存档中解压缩前 N 个字节和后 N 个字节，无需下载和解压缩整个文件

问题描述投票：0回答：1

1个回答

最新问题

从位于 s3 上的大型 gzip 存档中解压缩前 N 个字节和后 N 个字节，无需下载和解压缩整个文件

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1