PDFBox 从无密码加密的 PDF 中提取空白

Question

我正在使用 PDFBox 从表单中提取文本，并且我有一个未使用密码加密的 PDF，但 PDFBox 说已加密。我怀疑某种 Adobe“功能”，因为当我打开它时，它显示“安全”，而我没有问题的其他 PDF 则没有。

isEncrypted()

返回

true

，因此尽管没有密码，但它似乎以某种方式受到保护。

我怀疑它没有正确解密，因为它能够提取表单的文本提示，但不能提取响应本身。在下面的代码中，它从示例 PDF 中提取

Address (Street Name and Number)

和

City

，但不提取它们之间的响应。

我使用的是PDFBox 2.0，但我也尝试过1.8。

我已经尝试了所有可以找到的 PDFBox 解密方法，包括已弃用的方法（为什么不呢）。我得到的结果与根本不尝试解密相同，只是显示地址和城市提示。

由于 PDF 绝对是噩梦，因此该 PDF 很可能是以某种非标准方式创建的。感谢任何帮助识别这一点并再次行动的帮助。

PDF 样本

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.encryption.StandardDecryptionMaterial;
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.io.File;
import org.apache.pdfbox.pdmodel.PDPage;
import java.awt.Rectangle;
import java.util.List;


class Scratch {

    private static float pwidth;
    private static float pheight;

    private static int widthByPercent(double percent) {
        return (int)Math.round(percent * pwidth);
    }

    private static int heightByPercent(double percent) {
        return (int)Math.round(percent * pheight);
    }

    public static void main(String[] args) {
        try {
            //Create objects
            File inputStream = new File("ocr/TestDataFiles/i-9_08-07-09.pdf");

            PDDocument document = PDDocument.load(inputStream);

            // Try every decryption method I've found
            if(document.isEncrypted()) {

                // Method 1
                document.decrypt("");

                // Method 2
                document.openProtection(new StandardDecryptionMaterial(""));

                // Method 3
                document.setAllSecurityToBeRemoved(true);

                System.out.println("Removed encryption");
            }

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();

            //Get the page with data on it
            PDPageTree allPages = document.getDocumentCatalog().getPages();
            PDPage page = allPages.get(3);

            pheight = page.getMediaBox().getHeight();
            pwidth = page.getMediaBox().getWidth();

            Rectangle LastName = new Rectangle(widthByPercent(0.02), heightByPercent(0.195), widthByPercent(0.27), heightByPercent(0.1));
            stripper.addRegion("LastName", LastName);
            stripper.setSortByPosition(true);
            stripper.extractRegions(page);
            List<String> regions = stripper.getRegions();

            System.out.println(stripper.getTextForRegion("LastName"));

        } catch (Exception e){
            System.out.println(e.getMessage());
        }
    }
}

Answer 1

Brunos 评论解释了为什么即使您不需要输入密码，PDF 也会被加密：

PDF 可以使用两个密码加密：user 密码和owner 密码。当 PDF 使用“用户”密码加密时，如果不输入该密码，您将无法在 PDF 查看器中打开该文档。当 PDF 仅使用“所有者”密码加密时，每个人都可以在没有该密码的情况下打开 PDF，但可能存在一些限制。您可以识别使用所有者密码加密的 PDF，因为它们在 Adobe Reader 中提到“安全”。 您的 PDF 仅使用所有者密码进行加密，即用户密码为空。因此，您可以在 PDFBox 版本中使用空密码 ""
对其进行解密：

document.decrypt("");

（这个“方法1”和你的“方法2”一模一样

document.openProtection(new StandardDecryptionMaterial(""));

加上一些异常包装。）

Tilman 的评论暗示了您不检索表单值的原因：您的代码使用 PDFTextStripperByArea

进行文本提取，但此文本提取仅提取

固定页面内容

，而不是

annotations

浮动内容在那一页上。您要提取的内容是表单字段的内容，其小部件是注释。蒂尔曼的提议

doc.getDocumentCatalog().getAcroForm().getField("form1[0].#subform[3].address[0]").getValueAsString()

展示了如何提取您知道名称的表单字段的值，在本例中为
"form1[0].#subform[3].address[0]"
。如果您不知道要从中提取内容的字段的名称，则

PDAcroForm

返回的

doc.getDocumentCatalog().getAcroForm()

对象具有许多其他方法来访问字段内容。

顺便说一下，

AcroForm

定义中的

"form1[0].#subform[3].address[0]"

这样的字段名称表明了 PDF 的另一个特点：它实际上包含

两个表单定义

、核心 PDF AcroForm 定义和更独立的 XFA 定义。两者都描述相同的视觉形式。这样的 PDF 表单称为 混合 PDF 表单。混合表单的优点在于，可以使用只认识AcroForm表单的PDF工具（基本上是除Adobe之外的所有软件）查看和填写它们，而支持XFA的PDF工具（基本上只有Adobe的软件）可以利用额外的 XFA 功能。

混合表单的缺点是，如果使用不支持 XFA 的工具填写，则仅更新 AcroForm 信息，而 XFA 信息保持不变。因此，混合文档可以包含同一字段的不同数据...

PDDocument document = PDDocument.load(file); // Check if the PDF is encrypted. if (document.isEncrypted()) { // Get the document's access permissions. AccessPermission accessPermission = document.getCurrentAccessPermission(); // Check if the document has content extraction restrictions or owner permissions. if (!accessPermission.canExtractContent() || accessPermission.isOwnerPermission()) { // The PDF is encrypted and has content extraction restrictions or owner permissions. // It is not password-protected. } else { // The PDF is encrypted and does not have content extraction restrictions or owner permissions. // It is password-protected. } }

PDFBox 从无密码加密的 PDF 中提取空白

问题描述投票：0回答：2

2个回答

最新问题

PDFBox 从无密码加密的 PDF 中提取空白

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2