Python-识别压缩文件类型并解压缩的机制

Question

压缩文件可分为以下逻辑组一种。您正在使用的操作系统（* ix，Win）等。b。不同类型的压缩算法（即.zip，.Z，.bz2，.rar，.gzip）。从最常用的压缩文件的标准列表中了解。C。然后我们有了tar球机制-我想这里没有压缩。但这更像是串联。

现在，如果我们开始处理上述压缩文件集，一种。选项（a）将由python处理，因为它是平台无关的语言。b。选项（b）和（c）似乎有问题。

我需要什么如何确定文件类型（压缩类型），然后对其进行UN压缩？

赞：

fileType = getFileType(fileName)  
switch(fileType):  
case .rar:  unrar....
case .zip:  unzip....

etc

因此，基本问题是如何基于文件识别压缩算法（假设未提供扩展名或扩展名不正确）？在python中有什么特定的方法吗？

Answer 1

This page包含“魔术”文件签名的列表。抓住您需要的内容，并将其放入如下所示的字典中。然后，我们需要一个将dict键与文件开头相匹配的函数。我已经写了一个建议，尽管可以通过将magic_dict预处理为例如一个巨型编译的正则表达式。

magic_dict = {
    "\x1f\x8b\x08": "gz",
    "\x42\x5a\x68": "bz2",
    "\x50\x4b\x03\x04": "zip"
    }

max_len = max(len(x) for x in magic_dict)

def file_type(filename):
    with open(filename) as f:
        file_start = f.read(max_len)
    for magic, filetype in magic_dict.items():
        if file_start.startswith(magic):
            return filetype
    return "no match"

此解决方案应该是跨平台的，并且当然不依赖于文件扩展名，但是对于具有随机内容且恰好以某些特定魔术字节开头的文件，它可能会给出误报。

Answer 2

基于懒惰的回答和我的评论，这是我的意思：

class CompressedFile (object):
    magic = None
    file_type = None
    mime_type = None
    proper_extension = None

    def __init__(self, f):
        # f is an open file or file like object
        self.f = f
        self.accessor = self.open()

    @classmethod
    def is_magic(self, data):
        return data.startswith(self.magic)

    def open(self):
        return None

import zipfile

class ZIPFile (CompressedFile):
    magic = '\x50\x4b\x03\x04'
    file_type = 'zip'
    mime_type = 'compressed/zip'

    def open(self):
        return zipfile.ZipFile(self.f)

import bz2

class BZ2File (CompressedFile):
    magic = '\x42\x5a\x68'
    file_type = 'bz2'
    mime_type = 'compressed/bz2'

    def open(self):
        return bz2.BZ2File(self.f)

import gzip

class GZFile (CompressedFile):
    magic = '\x1f\x8b\x08'
    file_type = 'gz'
    mime_type = 'compressed/gz'

    def open(self):
        return gzip.GzipFile(self.f)


# factory function to create a suitable instance for accessing files
def get_compressed_file(filename):
    with file(filename, 'rb') as f:
        start_of_file = f.read(1024)
        f.seek(0)
        for cls in (ZIPFile, BZ2File, GZFile):
            if cls.is_magic(start_of_file):
                return cls(f)

        return None

filename='test.zip'
cf = get_compressed_file(filename)
if cf is not None:
    print filename, 'is a', cf.mime_type, 'file'
    print cf.accessor

现在可以使用cf.accessor访问压缩数据。所有模块都提供类似的方法，例如'read（）'，'write（）'等。]

Answer 3

这是一个复杂的问题，取决于许多因素：最重要的是解决方案的便携性。

Answer 4

0
投票

“”完全是错误的。

Answer 5

如果只是为了标记文件而进行识别，您将有很多答案。如果要解压缩归档文件，为什么不尝试捕获执行/错误？例如：

Answer 6

2019更新：我一直在寻找一种解决方案，以检测.csv文件是否已压缩。 @Lauritz给出的答案为我抛出了错误，我想这仅仅是因为文件的读取方式在过去7年中发生了变化。

Python-识别压缩文件类型并解压缩的机制

问题描述投票：25回答：6

6个回答

最新问题

Python-识别压缩文件类型并解压缩的机制

问题描述 投票：25回答：6

6个回答

最新问题

问题描述投票：25回答：6