从多个7-zip文件中提取特定的文件扩展名

Question

我有一个RAR文件和一个ZIP文件。在这两个文件夹中有一个文件夹。文件夹内有几个7-zip（.7z）文件。在每7z中，有多个具有相同扩展名的文件，但其名称有所不同。

RAR or ZIP file
  |___folder
        |_____Multiple 7z
                  |_____Multiple files with same extension and different name

我只想从数千个文件中提取我需要的文件...我需要那些名称包含特定子字符串的文件。例如，如果压缩文件的名称中包含'[!]'或'(U)'或'(J)'是确定要提取的文件的标准。

我可以毫无问题地提取文件夹，所以我有这个结构：

folder
   |_____Multiple 7z
                |_____Multiple files with same extension and different name

我在Windows环境中，但是我安装了Cygwin。我想知道如何轻松提取我需要的文件？也许使用单个命令行。

更新

该问题有一些改进：

内部7z文件及其内部的相应文件的名称中可以有空格。
有7z个文件，其中只有一个不符合给定条件的文件。因此，作为唯一可能的文件，也必须提取它们。

解决方案

谢谢大家。 bash解决方案是帮助我解决问题的一种方法。我无法测试Python3解决方案，因为尝试使用pip安装库时遇到了问题。我不使用Python，所以我必须学习并克服这些解决方案所面临的错误。目前，我已经找到了合适的答案。谢谢大家。

Answer 1

此解决方案基于bash，grep和awk，可在Cygwin和Ubuntu上使用。

由于您需要先搜索(X) [!].ext文件，并且如果没有此类文件，然后再寻找(X).ext文件，我认为不可能编写单个表达式来处理此逻辑。

该解决方案应具有一些if / else条件逻辑，以测试存档中的文件列表并确定要提取的文件。

这是我在其上测试脚本的zip / rar归档文件中的初始结构（我制作了script以准备此结构）：

folder
├── 7z_1.7z
│   ├── (E).txt
│   ├── (J) [!].txt
│   ├── (J).txt
│   ├── (U) [!].txt
│   └── (U).txt
├── 7z_2.7z
│   ├── (J) [b1].txt
│   ├── (J) [b2].txt
│   ├── (J) [o1].txt
│   └── (J).txt
├── 7z_3.7z
│   ├── (E) [!].txt
│   ├── (J).txt
│   └── (U).txt
└── 7z 4.7z
    └── test.txt

输出是这个：

output
├── 7z_1.7z           # This is a folder, not an archive
│   ├── (J) [!].txt   # Here we extracted only files with [!]
│   └── (U) [!].txt
├── 7z_2.7z
│   └── (J).txt       # Here there are no [!] files, so we extracted (J)
├── 7z_3.7z
│   └── (E) [!].txt   # We had here both [!] and (J), extracted only file with [!]
└── 7z 4.7z
    └── test.txt      # We had only one file here, extracted it

这是进行提取的script：

#!/bin/bash

# Remove the output (if it's left from previous runs).
rm -r output
mkdir -p output

# Unzip the zip archive.
unzip data.zip -d output
# For rar use
#  unrar x data.rar output
# OR
#  7z x -ooutput data.rar

for archive in output/folder/*.7z
do
  # See https://stackoverflow.com/questions/7148604
  # Get the list of file names, remove the extra output of "7z l"
  list=$(7z l "$archive" | awk '
      /----/ {p = ++p % 2; next}
      $NF == "Name" {pos = index($0,"Name")}
      p {print substr($0,pos)}
  ')
  # Get the list of files with [!].
  extract_list=$(echo "$list" | grep "[!]")
  if [[ -z $extract_list ]]; then
    # If we don't have files with [!], then look for ([A-Z]) pattern
    # to get files with single letter in brackets.
    extract_list=$(echo "$list" | grep "([A-Z])\.")
  fi
  if [[ -z $extract_list ]]; then
    # If we only have one file - extract it.
    if [[ ${#list[@]} -eq 1 ]]; then
      extract_list=$list
    fi
  fi
  if [[ ! -z $extract_list ]]; then
    # If we have files to extract, then do the extraction.
    # Output path is output/7zip_archive_name/
    out_path=output/$(basename "$archive")
    mkdir -p "$out_path"
    echo "$extract_list" | xargs -I {} 7z x -o"$out_path" "$archive" {}
  fi
done

这里的基本思想是遍历7zip归档文件，并使用7z l命令获取每个文件的文件列表（文件列表）。

该命令的输出非常冗长，因此我们使用awk对其进行清理并获得文件名列表。

[此后，我们使用grep过滤此列表以获取[!]文件的列表或(X)文件的列表。然后，我们将此列表传递给7zip，以提取所需的文件。

Answer 2

如何使用此命令行：

7z -e c:\myDir\*.7z -oc:\outDir "*(U)*.ext" "*(J)*.ext" "*[!]*.ext" -y

哪里：

myDir是您的解压缩文件夹
outDir是您的输出目录
ext是您的文件扩展名

-y选项用于在不同归档文件中具有相同文件名的情况下强制覆盖。

Answer 3

您声明可以使用linux，在问题悬赏页脚中。而且我不使用Windows。对于那个很抱歉。我正在使用Python3，并且您必须处于linux环境中（我将尽快尝试在Windows上对此进行测试）。

存档结构

datadir.rar
          |
          datadir/
                 |
                 zip1.7z
                 zip2.7z
                 zip3.7z
                 zip4.7z
                 zip5.7z

提取结构

extracted/
├── zip1
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip2
│   ├── (E) [!].txt
│   ├── (J) [!].txt
│   └── (U) [!].txt
├── zip3
│   ├── (J) [!].txt
│   └── (U) [!].txt
└── zip5
    ├── (J).txt
    └── (U).txt

这是我的做法。

import libarchive.public
import os, os.path
from os.path import basename
import errno
import rarfile

#========== FILE UTILS =================

#Make directories
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else: raise

#Open "path" for writing, creating any parent directories as needed.
def safe_open_w(path):
    mkdir_p(os.path.dirname(path))
    return open(path, 'wb')

#========== RAR TOOLS ==================

# List
def rar_list(rar_archive):
    with rarfile.RarFile(rar_archive) as rf:
        return rf.namelist()

# extract
def rar_extract(rar_archive, filename, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extract(filename,path)

# extract-all
def rar_extract_all(rar_archive, path):
    with rarfile.RarFile(rar_archive) as rf:
        rf.extractall(path)

#========= 7ZIP TOOLS ==================

# List
def zip7_list(zip7file):
    filelist = []
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            filelist.append(entry.pathname.decode("utf-8"))
    return filelist

# extract
def zip7_extract(zip7file, filename, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if entry.pathname.decode("utf-8") == filename:
                with safe_open_w(os.path.join(path, filename)) as q:
                    for block in entry.get_blocks():
                        q.write(block)
                break

# extract-all
def zip7_extract_all(zip7file, path):
    with open(zip7file, 'rb') as f:
        for entry in libarchive.public.memory_pour(f.read()):
            if os.path.isdir(entry.pathname.decode("utf-8")):
                continue
            with safe_open_w(os.path.join(path, entry.pathname.decode("utf-8"))) as q:
                for block in entry.get_blocks():
                    q.write(block)

#============ FILE FILTER =================

def exclamation_filter(filename):
    return ("[!]" in filename)

def optional_code_filter(filename):
    return not ("[" in filename)

def has_exclamation_files(filelist):
    for singlefile in filelist:
        if(exclamation_filter(singlefile)):
            return True
    return False

#============ MAIN PROGRAM ================

print("-------------------------")
print("Program Started")
print("-------------------------")

BIG_RAR = 'datadir.rar'
TEMP_DIR = 'temp'
EXTRACT_DIR = 'extracted'
newzip7filelist = []

#Extract big rar and get new file list
for zipfilepath in rar_list(BIG_RAR):
    rar_extract(BIG_RAR, zipfilepath, TEMP_DIR)
    newzip7filelist.append(os.path.join(TEMP_DIR, zipfilepath))

print("7z Files Extracted")
print("-------------------------")

for newzip7file in newzip7filelist:
    innerFiles = zip7_list(newzip7file)
    for singleFile in innerFiles:
        fileSelected = False
        if(has_exclamation_files(innerFiles)):
            if exclamation_filter(singleFile): fileSelected = True
        else:
            if optional_code_filter(singleFile): fileSelected = True
        if(fileSelected):
            print(singleFile)
            outputFile = os.path.join(EXTRACT_DIR, os.path.splitext(basename(newzip7file))[0])
            zip7_extract(newzip7file, singleFile, outputFile)

print("-------------------------")
print("Extraction Complete")
print("-------------------------")

在主程序之上，我已经准备好所有必需的功能。我没有全部使用它们，但是保留了它们，以防您需要它们。

我在python3中使用了多个python库，但是您只需要使用libarchive安装rarfile和pip，其他都是内置库。

这是copy of my source tree

控制台输出

这是运行此python文件时的控制台输出，

-------------------------
Program Started
-------------------------
7z Files Extracted
-------------------------
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(E) [!].txt
(J) [!].txt
(U) [!].txt
(J).txt
(U).txt
-------------------------
Extraction Complete
-------------------------

问题

到目前为止，我唯一遇到的问题是，程序根目录中会生成一些临时文件。无论如何它都不会影响程序，但是我会尝试修复它。

编辑

您必须跑步

sudo apt-get install libarchive-dev

安装实际的libarchive程序。 Python库只是它的包装器。看看official documentation。

Answer 4

这是经过一些尝试后的最终版本。 Previous没有用，所以我将其删除，而不是附加。读到最后，因为最终解决方案可能不需要所有内容。

到主题。我会使用Python。如果这是一项一次性的任务，那么它可能会显得过大，但在任何其他情况下，您都可以记录所有步骤以供将来调查，正则表达式，编排一些命令以提供输入以及获取和处理输出。所有这些情况在Python中都非常容易。如果有的话。

现在，我将写一封做env的书。配置。并非全部都是强制性的，但是尝试安装需要执行一些步骤，也许对过程的描述本身可能会有所帮助。

我有MinGW-32位版本。但是，提取7zip并不是强制性的。安装后转到C:\MinGW\bin并运行mingw-get.exe：

Basic Setup我已经安装了msys-base（右键单击，从“安装”菜单中标记为要安装-应用更改）。这样，我就可以使用bash，sed，grep等了。
在All Packages中有mingw32-libarchive with dll as class. Since python libarchive`包只是一个包装，您需要此dll才能真正包装二进制文件。

示例适用于Python3。我使用的是32位版本。您可以从其主页fetch。我已经安装在默认目录中，这很奇怪。因此建议您将其安装在磁盘的根目录中，例如mingw。

[其他事项-conemu比默认控制台要好得多。

在Python中安装软件包。为此使用pip。从您的控制台转到Python home，那里有Scripts子目录。对我来说是：c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\Scripts。您可以使用例如pip search archive搜索，并使用pip install libarchive-c安装：

> pip.exe install libarchive-c
Collecting libarchive-c
  Downloading libarchive_c-2.7-py2.py3-none-any.whl
Installing collected packages: libarchive-c
Successfully installed libarchive-c-2.7

在cd ..调用python之后，可以使用/导入新的库：

>>> import libarchive
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 27, in <module>
    libarchive = ctypes.cdll.LoadLibrary(libarchive_path)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 426, in LoadLibrary
   return self._dlltype(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
TypeError: LoadLibrary() argument 1 must be str, not None

因此失败。我已尝试解决此问题，但失败了：

>>> import libarchive
read format "cab" is not supported
read format "7zip" is not supported
read format "rar" is not supported
read format "lha" is not supported
read filter "uu" is not supported
read filter "lzop" is not supported
read filter "grzip" is not supported
read filter "bzip2" is not supported
read filter "rpm" is not supported
read filter "xz" is not supported
read filter "none" is not supported
read filter "compress" is not supported
read filter "all" is not supported
read filter "lzma" is not supported
read filter "lzip" is not supported
read filter "lrzip" is not supported
read filter "gzip" is not supported
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\__init__.py", line 1, in <module>
    from .entry import ArchiveEntry
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\entry.py", line 6, in <module>
    from . import ffi
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 167, in <module>
    c_int, check_int)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\site-packages\libarchive\ffi.py", line 92, in ffi
    f = getattr(libarchive, 'archive_'+name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 361, in __getattr__
    func = self.__getitem__(name)
  File "c:\Users\<<username>>\AppData\Local\Programs\Python\Python36-32\lib\ctypes\__init__.py", line 366, in __getitem__
    func = self._FuncPtr((name_or_ordinal, self))
AttributeError: function 'archive_read_open_filename_w' not found

尝试过使用set命令直接提供信息，但是失败了。于是，我移至pylzma-因为不需要使用mingw。 pip安装失败：

> pip.exe install pylzma
Collecting pylzma
  Downloading pylzma-0.4.9.tar.gz (115kB)
    100% |--------------------------------| 122kB 1.3MB/s
Installing collected packages: pylzma
  Running setup.py install for pylzma ... error
    Complete output from command c:\users\texxas\appdata\local\programs\python\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\texxas\\AppData\\Local\\Temp\\pip-build-99t_zgmz\\pylzma\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\texxas\AppData\Local\Temp\pip-ffe3nbwk-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build\lib.win32-3.6
    copying py7zlib.py -> build\lib.win32-3.6
    running build_ext
    adding support for multithreaded compression
    building 'pylzma' extension
    error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

再次失败。但这很简单-我已经安装了Visual Studio构建工具2015，并且行得通。我已经安装了sevenzip，所以我创建了示例档案。所以最后我可以启动python并做：

from py7zlib import Archive7z
f = open(r"C:\Users\texxas\Desktop\try.7z", 'rb')
a = Archive7z(f)
a.filenames

并得到空列表。仔细观察...可以更好地理解-pylzma不考虑空文件-只是为了使您意识到这一点。因此，将一个字符放入示例文件中，最后一行给出：

>>> a.filenames
['try/a/test.txt', 'try/a/test1.txt', 'try/a/test2.txt', 'try/a/test3.txt', 'try/a/test4.txt', 'try/a/test5.txt', 'try/a/test6.txt', 'try/a/test7.txt', 'try/b/test.txt', 'try/b/test1.txt', 'try/b/test2.txt', 'try/b/test3.txt', 'try/b/test4.txt', 'try/b/test5.txt', 'try/b/test6.txt', 'try/b/test7.txt', 'try/c/test.txt', 'try/c/test1.txt', 'try/c/test11.txt', 'try/c/test2.txt', 'try/c/test3.txt', 'try/c/test4.txt', 'try/c/test5.txt', 'try/c/test6.txt', 'try/c/test7.txt']

所以...休息真是小菜一碟。实际上，这是原始帖子的一部分：

import os
import py7zlib

for folder, subfolders, files in os.walk('.'):
    for file in files:
        if file.endswith('.7z'):
            # sooo 7z archive - extract needed.
            try:
                with open(file, 'rb') as f:
                    z = py7zlib.Archive7z(f)
                    for file in z.list():
                        if arch.getinfo(file).filename.endswith('*.py'):
                            arch.extract(file, './dest')
            except py7zlib.FormatError as e:
                print ('file ' + file)
                print (str(e))

作为旁注-Anaconda是很棒的工具，但是完整安装需要500 + MB，所以太多了。

也让我分享来自我的github的wmctrl.py工具：

cmd = 'wmctrl -ir ' + str(active.window) + \
      ' -e 0,' + str(stored.left) + ',' + str(stored.top) + ',' + str(stored.width) + ',' + str(stored.height)
print cmd
res = getoutput(cmd)

这样您可以编排不同的命令-在这里是wmctrl。可以以允许数据处理的方式处理结果。

从多个7-zip文件中提取特定的文件扩展名

问题描述投票：1回答：4

更新

解决方案

4个回答

存档结构

提取结构

这是我的做法。

控制台输出

问题

编辑

最新问题

从多个7-zip文件中提取特定的文件扩展名

问题描述 投票：1回答：4

更新

解决方案

4个回答

存档结构

提取结构

这是我的做法。

控制台输出

问题

编辑

最新问题

问题描述投票：1回答：4