如何使用 pdfminer 6 获取可编辑字段的页码

问题描述 投票:0回答:3

我按照此答案中的示例从 PDF 文档中获取可编辑字段值:

如何从Python中填写的表单中提取PDF字段?

对于每个字段,我都会得到一个如下所示的数据结构。但该列表包含所有页面的所有字段。如何确定每个字段位于哪个页面?在调试器中,我尝试查看“AP”和“P”项,它们是 PDFObjRef 的,但这并没有引导我到任何地方。

'AP' = {dict: 1} {'N': <PDFObjRef:1947>}
'DA' = {bytes: 23} b'0 0 0 rg /ArialMT 10 Tf'
'F' = {int} 4
'FT' = {PSLiteral} /'Tx'
'M' = {bytes: 23} b"D:20200129121854-06'00'"
'MK' = {dict: 0} {}
'P' = {PDFObjRef} <PDFObjRef:1887>
'Rect' = {list: 4} [36.3844, 28.5617, 254.605, 55.1097]
'StructParent' = {int} 213
'Subtype' = {PSLiteral} /'Widget'
'T' = {bytes: 12} b'CustomerName'
'TU' = {bytes: 13} b'Customer Name'
'Type' = {PSLiteral} /'Annot'
'V' = {bytes: 21} b'Ball-Mart Stores, Inc.'

蒂亚

python pdf field pdfminer page-numbering
3个回答
0
投票

同样的问题,我花了 2 个小时才通过查看 PDF 找到了 page.annots 的想法。

它适用于 PyPDF2。

doc
之前由
doc = open('sample.pdf')

初始化
idtopg = {}
pge = 0
for page in PDFPage.create_pages(doc):
    if page.annots:
        for annot in page.annots:
            por = PDFObjRef.resolve(annot)
            aid = por['T'].decode("utf-8")
            idtopg[aid] = pge
    pge += 1

现在看看你的“T”。此处生成的字典为您提供了每个“T”的页面

myfieldid = thenameofyourfield['T'].decode('utf-8')
print("The field id {0} in on page {1}".format(myfieldid, idtopg[myfieldid])

0
投票

我可以通过执行以下操作来获取字段的页码:

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

fp = open(PdfUtility.resource_path(filename), 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
kids = resolve1(doc.catalog['Pages'])['Kids']
page = 0
field_list = []
for kid in kids:
    page += 1
    kid_fields = resolve1(resolve1(kid)['Annots'])
    for i in kid_fields:
        field_dict = {}
        field = resolve1(i)
        name, position = field.get('T'), field.get('Rect')
        if name:
            field_dict['name'] = name.decode('utf-8')
            field_dict['page'] = page
            field_dict['position'] = position
            print(field_dict)
            field_list.append(field_dict)

0
投票

您在示例中提供的字段数据中包含

'P'
PDFObjRef
属性确实指向 PDF 页面对象。

也就是说,这就是我在项目中使用

pdfminer
库解析文档中每个 Acro 字段的页码的方法:

# Imports
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.utils import decode_text

# Setup
parser = PDFParser(open('somewhere/example.pdf', 'rb'))
doc = PDFDocument(parser)

if "AcroForm" in resolve1(doc.catalog):

    pages = resolve1(doc.catalog['Pages'])
    page_kids = [str(ref) for ref in pages.get('Kids')]
    # print pages example => {'ITXT': b'4.1.6', 'Type': /'Pages', 'Kids': [<PDFObjRef:744>, <PDFObjRef:4>], 'Count': 2}

    fields = [resolve1(f) for f in resolve1(doc.catalog['AcroForm'])['Fields']]
    # print fields example => [
    # {... 'Type': /'Annot', 'T': b'field1page1', 'P': <PDFObjRef:744> ...}
    # ...
    # {... 'Type': /'Annot', 'T': b'field2page2', 'P': <PDFObjRef:4> ...}
    # ]

    # NOTE: child fields would need some further resolving here, but same logic would apply for computing the page number
    for f in fields:
        f_page_ref = f.get('P')

        if f_page_ref is not None:
            f_page_num = page_kids.index(str(f_page_ref)) + 1
            print(f_page_num)  # WHAT YOU'RE LOOKING FOR

© www.soinside.com 2019 - 2024. All rights reserved.