如何在Python中提取嵌入docx文件的文件？

Question

我正在尝试提取Python中嵌入docx文件的文件。我创建了一个简单的 docx，其中包含一个嵌入式 pdf、一个嵌入式 zip 和一个嵌入式 docx。

我一直在使用下面的代码。

from pathlib import Path
from zipfile import ZipFile
import pandas as pd

docx_path = Path('junk/test.docx')
docx_unzip_folder = Path('junk')

unzipped_files = []
with ZipFile(docx_path, mode='r') as zip:
    # locate all zip components
    for entry in zip.infolist():
        print(entry.filename)
        # unzip the zip components
        with zip.open(entry.filename, mode='r') as fzip_part:
            # replace the folders separator with underscore for convenience
            docx_unzip_file = docx_unzip_folder/(entry.filename.replace('/','_'))
            with open(docx_unzip_file, 'wb') as output_file:
                output_file.write(fzip_part.read())
            unzipped_file = {'originating docx': docx_path.name,
                             'unzipped file name': docx_unzip_file.name,
                             'unzipped file path': docx_unzip_file,
                             }
            unzipped_files.append(unzipped_file)

unzipped_files = pd.DataFrame(unzipped_files)

它的效果很好，因为它将所有嵌入文件输出为

word_embeddings_oleObject1.bin
word_embeddings_oleObject2.bin
...

通过处理生成的各种 xml 中的信息，我可以推断出嵌入文件的 mime 类型及其在包含的 docx 文档中的位置。但是，当我尝试在 acrobat/gzip 中打开 pdf 和 zip 时，即使我更改扩展名，它们也不会打开。嵌入的Word文件可以正常打开。

有什么线索吗？

PS：请注意，嵌入的 Word 文件表现得更好，因为上述方法使用正确的扩展名保存它们，并且它们可以正常打开。问题是 pdf 和 zip 文件。

非常感谢。

Answer 1

要从 Word.DocX 中提取嵌入内容，您只需使用 Windows T[ape]ar[chive] 命令将其解压缩到此处的工作子文件夹中即可。

提取后，您可以直接使用嵌套的 docx，例如在写作中使用它

但是BIN文件实际上是嵌套的OLE.DOC容器，但不能直接通过字或写操作。它们的内容需要通过不同的方式解包。这个问题已在 Stack Overflow 中描述过多次，但似乎只有您定义自定义检测并使用剪贴板剪报的建议，例如请参阅https://stackoverflow.com/a/71106113/10802527

需要注意的是，zip和PDF必须作为纯二进制文件处理，没有任何字节编码，在需要拼接的变量Head和Tail之间。

如何在Python中提取嵌入docx文件的文件？

问题描述投票：0回答：1

1个回答

最新问题

如何在Python中提取嵌入docx文件的文件？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1