Python 检索 PDF 的书签结构：出现错误

Question

我正在尝试检索 PDF 的书签结构，理想情况下，我还想要层次结构。

我遵循了这个线程：从 PDF 文档中读取所有书签，并使用书签的页码和标题创建一个字典

我的想法是使用这段代码作为起点，然后我可以根据我的需要进行调整。

我设法让它工作，但我的代码经常崩溃并出现以下错误“（ ValueError：没有足够的值来解压（预期 3，得到 1）”

C:\Users\XXXXX\PycharmProjects\pythonProject env\Scripts\python.exe“C:\Google Drive\python\projects\Get bookmarks.py”

回溯（最近一次调用最后一次）：

文件“C:\Google Drive\python\projects\Get bookmarks.py”，第 24 行，位于 bms = bookmark_dict(reader.outline, use_labels=False)

文件“C:\Users\XXXXX\PycharmProjects\pythonProject env\lib\site-packages\pypdf_reader.py”，第 844 行，大纲返回 self._get_outline()

文件“C：\ Users \ XXXXX \ PycharmProjects \ pythonProject env \ lib \ site-packages \ pypdf_reader.py”，第880行，在_get_outline中 Outline_obj = self._build_outline_item(节点)

文件“C：\ Users \ XXXXX \ PycharmProjects \ pythonProject env \ lib \ site-packages \ pypdf_reader.py”，第1054行，在_build_outline_item中 profile_item = self._build_destination(标题, dest)

文件“C:\Users\XXXXX\PycharmProjects\pythonProject env\lib\site-packages\pypdf_reader.py”，第 1018 行，位于 _build_destination return Destination(title, page, Fit(fit_type=typ, fit_args=array)) # 类型：忽略

文件“C:\Users\XXXXX\PycharmProjects\pythonProject env\lib\site-packages\pypdf\generic_data_structs.py”，第 1495 行，在 init （ ValueError：没有足够的值来解压（预期 3，得到 2）

进程已完成，退出代码为 1

这是我正在使用的代码（直接使用上面提到的线程）。

from typing import Dict, Union
from pypdf import PdfReader

def bookmark_dict(
        bookmark_list, use_labels: bool = False
) -> Dict[Union[str, int], str]:
    result = {}
    for item in bookmark_list:
        if isinstance(item, list):
            result.update(bookmark_dict(item))
        else:
            page_index = reader.get_destination_page_number(item)
            page_label = reader.page_labels[page_index]
            if use_labels:
                result[page_label] = item.title
            else:
                result[page_index] = item.title
    return result

if __name__ == "__main__":
    folder ="x:\\"
    file="TestPDF.pdf"
    reader = PdfReader(folder + file)
    bms = bookmark_dict(reader.outline, use_labels=False)
    for page_nb, title in sorted(bms.items(), key=lambda n: f"{str(n[0]):>5}"):
         print(f"{page_nb:>3}: {title}")

给我带来错误的 PDF 文件可以在这里找到：https://easyupload.io/7fsipz

谢谢大家！

Answer 1

该文件有许多格式错误的书签：

cpdf -list-bookmarks-json TestPDF.pdf > foo.json
Warning: Could not read destination G [1 0 R/XYZ 820] 
Warning: Could not read destination G [4 0 R/XYZ 837] 
Warning: Could not read destination G [20 0 R/XYZ 827] 
Warning: Could not read destination G [34 0 R/XYZ 817] 
Warning: Could not read destination G [40 0 R/XYZ 813] 
Warning: Could not read destination G [43 0 R/XYZ 831] 
Warning: Could not read destination G [56 0 R/XYZ 826] 
Warning: Could not read destination G [59 0 R/XYZ 804] 
Warning: Could not read destination G [59 0 R/XYZ 608] 
Warning: Could not read destination G [59 0 R/XYZ 529] 
Warning: Could not read destination G [59 0 R/XYZ 608] 
(many more lines of output)

您正在使用的库似乎不知道如何跳过它们。您可以使用上面的 cpdf 来获取所有有效书签的 JSON 文件。

Python 检索 PDF 的书签结构：出现错误

问题描述投票：0回答：1

1个回答

最新问题

Python 检索 PDF 的书签结构：出现错误

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1