java.io.IOException:错误:文件结束,PDFBox 的预期行问题

问题描述 投票:0回答:1

我正在尝试从在浏览器中打开的 PDF 中读取 PDF 文本。

单击“打印”按钮后,以下 URL 在新选项卡中打开。

https://myappurl.com/employees/2Jb_rpRC710XGvs8xHSOmHE9_LGkL97j/details/listprint.pdf?ids%5B%5D=2Jb_rpRC711lmIvMaBdxnzJj_ZfipcXW

我用其他网址执行了相同的程序,发现工作正常。我使用了此处使用的相同代码(提取 PDF 文本).

我正在使用以下版本的 PDFBox。

    <dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>1.8.9</version>
</dependency>
<dependency>
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>fontbox</artifactId>
    <version>1.8.9</version>
</dependency>

以下是与其他 URLS 一起正常工作的代码:

public boolean verifyPDFContent(String strURL, String reqTextInPDF) {

    boolean flag = false;

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    String parsedText = null;

    try {
        URL url = new URL(strURL);
        BufferedInputStream file = new BufferedInputStream(url.openStream());
        PDFParser parser = new PDFParser(file);

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdfStripper.setStartPage(1);
        pdfStripper.setEndPage(1);

        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);
    } catch (MalformedURLException e2) {
        System.err.println("URL string could not be parsed "+e2.getMessage());
    } catch (IOException e) {
        System.err.println("Unable to open PDF Parser. " + e.getMessage());
        try {
            if (cosDoc != null)
                cosDoc.close();
            if (pdDoc != null)
                pdDoc.close();
        } catch (Exception e1) {
            e.printStackTrace();
        }
    }

    System.out.println("+++++++++++++++++");
    System.out.println(parsedText);
    System.out.println("+++++++++++++++++");

    if(parsedText.contains(reqTextInPDF)) {
        flag=true;
    }

    return flag;
}

下面是我得到的异常的 Stacktrace

java.io.IOException: Error: End-of-File, expected line
at org.apache.pdfbox.pdfparser.BaseParser.readLine(BaseParser.java:1517)
at org.apache.pdfbox.pdfparser.PDFParser.parseHeader(PDFParser.java:372)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:186)
at com.kareo.utils.PDFManager.getPDFContent(PDFManager.java:26)

更新我在 URL 和文件调试时拍摄的图像。 enter image description here 请帮帮我。这是带有“https”的东西吗???

java selenium-webdriver pdfbox
1个回答
0
投票

我们都知道文件流就像一个管道。一旦数据流过,就不能再次使用。所以你可以: 1.将输入流转换为文件。

public void useInputStreamTwiceBySaveToDisk(InputStream inputStream) { 
    String desPath = "test001.bin";
    try (BufferedInputStream is = new BufferedInputStream(inputStream);
         BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream(desPath))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            os.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    
    File file = new File(desPath);
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file))) { 
        int len;
        byte[] buffer = new byte[1024];
        while ((len = is.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

2.将输入流转换为数据。

public void useInputStreamTwiceSaveToByteArrayOutputStream(InputStream inputStream) { 
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    try { 
        byte[] buffer = new byte[1024];
        int len;
        while ((len = inputStream.read(buffer)) != -1) { 
            outputStream.write(buffer, 0, len);
        }
    } catch (IOException e) { 
        e.printStackTrace();
    }
    // first read InputStream
    InputStream inputStream1 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream1);
    // second read InputStream
    InputStream inputStream2 = new ByteArrayInputStream(outputStream.toByteArray());
    printInputStreamData(inputStream2);
}

3.用输入流标记和重置

public void useInputStreamTwiceByUseMarkAndReset(InputStream inputStream) { 
    StringBuilder sb = new StringBuilder();
    try (BufferedInputStream bufferedInputStream = new BufferedInputStream(inputStream, 10)) { 
        byte[] buffer = new byte[1024];
        //Call the mark method to mark
        //The number of bytes allowed to be read by the flag set here after reset is the maximum value of an integer
        bufferedInputStream.mark(bufferedInputStream.available() + 1);
        int len;
        while ((len = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len));
        }
        System.out.println(sb.toString());
        // After the first call, explicitly call the reset method to reset the flow
        bufferedInputStream.reset();
        // Read the second stream
        sb = new StringBuilder();
        int len1;
        while ((len1 = bufferedInputStream.read(buffer)) != -1) { 
            sb.append(new String(buffer, 0, len1));
        }
        System.out.println(sb.toString());
    } catch (IOException e) { 
        e.printStackTrace();
    }
}

然后你可以多次重复读取同一个输入流的操作。

© www.soinside.com 2019 - 2024. All rights reserved.