我按照此答案中的示例从 PDF 文档中获取可编辑字段值:
对于每个字段,我都会得到一个如下所示的数据结构。但该列表包含所有页面的所有字段。如何确定每个字段位于哪个页面?在调试器中,我尝试查看“AP”和“P”项,它们是 PDFObjRef 的,但这并没有引导我到任何地方。
'AP' = {dict: 1} {'N': <PDFObjRef:1947>}
'DA' = {bytes: 23} b'0 0 0 rg /ArialMT 10 Tf'
'F' = {int} 4
'FT' = {PSLiteral} /'Tx'
'M' = {bytes: 23} b"D:20200129121854-06'00'"
'MK' = {dict: 0} {}
'P' = {PDFObjRef} <PDFObjRef:1887>
'Rect' = {list: 4} [36.3844, 28.5617, 254.605, 55.1097]
'StructParent' = {int} 213
'Subtype' = {PSLiteral} /'Widget'
'T' = {bytes: 12} b'CustomerName'
'TU' = {bytes: 13} b'Customer Name'
'Type' = {PSLiteral} /'Annot'
'V' = {bytes: 21} b'Ball-Mart Stores, Inc.'
蒂亚
同样的问题,我花了 2 个小时才通过查看 PDF 找到了 page.annots 的想法。
它适用于 PyPDF2。
doc
之前由 doc = open('sample.pdf')
初始化
idtopg = {}
pge = 0
for page in PDFPage.create_pages(doc):
if page.annots:
for annot in page.annots:
por = PDFObjRef.resolve(annot)
aid = por['T'].decode("utf-8")
idtopg[aid] = pge
pge += 1
现在看看你的“T”。此处生成的字典为您提供了每个“T”的页面
myfieldid = thenameofyourfield['T'].decode('utf-8')
print("The field id {0} in on page {1}".format(myfieldid, idtopg[myfieldid])
我可以通过执行以下操作来获取字段的页码:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open(PdfUtility.resource_path(filename), 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
kids = resolve1(doc.catalog['Pages'])['Kids']
page = 0
field_list = []
for kid in kids:
page += 1
kid_fields = resolve1(resolve1(kid)['Annots'])
for i in kid_fields:
field_dict = {}
field = resolve1(i)
name, position = field.get('T'), field.get('Rect')
if name:
field_dict['name'] = name.decode('utf-8')
field_dict['page'] = page
field_dict['position'] = position
print(field_dict)
field_list.append(field_dict)
您在示例中提供的字段数据中包含
'P'
的 PDFObjRef
属性确实指向 PDF 页面对象。
也就是说,这就是我在项目中使用
pdfminer
库解析文档中每个 Acro 字段的页码的方法:
# Imports
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
from pdfminer.utils import decode_text
# Setup
parser = PDFParser(open('somewhere/example.pdf', 'rb'))
doc = PDFDocument(parser)
if "AcroForm" in resolve1(doc.catalog):
pages = resolve1(doc.catalog['Pages'])
page_kids = [str(ref) for ref in pages.get('Kids')]
# print pages example => {'ITXT': b'4.1.6', 'Type': /'Pages', 'Kids': [<PDFObjRef:744>, <PDFObjRef:4>], 'Count': 2}
fields = [resolve1(f) for f in resolve1(doc.catalog['AcroForm'])['Fields']]
# print fields example => [
# {... 'Type': /'Annot', 'T': b'field1page1', 'P': <PDFObjRef:744> ...}
# ...
# {... 'Type': /'Annot', 'T': b'field2page2', 'P': <PDFObjRef:4> ...}
# ]
# NOTE: child fields would need some further resolving here, but same logic would apply for computing the page number
for f in fields:
f_page_ref = f.get('P')
if f_page_ref is not None:
f_page_num = page_kids.index(str(f_page_ref)) + 1
print(f_page_num) # WHAT YOU'RE LOOKING FOR