无法使用具有多个线程的PDFTextStripper读取单个页面

问题描述 投票:0回答:1

我能够创建10个线程。但是问题是当我尝试使用并行样式的那些线程访问单个页面时。我尝试将私有静态PDFTextStripper实例也放入同步块中。我仍然得到以下例外:

COSStream已关闭,无法读取。可能其随附的PDDocument已关闭?

尝试在前10页的每页中打印第一个单词,但是不起作用。这是我对多线程和PDF阅读的第一个实验。任何帮助将不胜感激。

public class ReadPDFFile extends Thread implements FileInstance {
    private static String fileLocation;
    private static String fileNameIV;
    private static String userInput;
    private static int userConfidence;
    private static int totalPages;
    private static ConcurrentHashMap<Integer, List<String>> map = null;
    private Iterator<PDDocument> iteratorForThisDoc;
    private PDFTextStripperByArea text;
    private static PDFTextStripper pdfStrip = null;
    private static PDFParser pdParser = null;
    private Splitter splitter;
    private static int counter=0;
    private StringWriter writer;
    private static  ReentrantLock counterLock = new ReentrantLock(true);
    private static PDDocument doc;
    private static PDDocument doc2;
    private static boolean flag = false;
    List<PDDocument> listOfPages;

    ReadPDFFile(String filePath, String fileName, String userSearch, int confidence) throws FileNotFoundException{
        fileLocation= filePath;
        fileNameIV =  fileName;
        userInput= userSearch;
        userConfidence = confidence;
        System.out.println("object created");
    }

    @Override
    public void createFileInstance(String filePath, String fileName) {
        List<String> list = new ArrayList<String>();
        map = new ConcurrentHashMap<Integer, List<String>>();
        try(PDDocument document = PDDocument.load(new File(filePath))){
            doc = document;
            pdfStrip = new PDFTextStripper();
            this.splitter = new Splitter();
            text = new PDFTextStripperByArea();
            document.getClass();
            if(!document.isEncrypted()) {
                totalPages = document.getNumberOfPages();
                System.out.println("Number of pages in this book "+totalPages);
                listOfPages = splitter.split(document);
                iteratorForThisDoc = listOfPages.iterator();
            }
            this.createThreads();
            /*
             * for(int i=0;i<1759;i++) { readThisPage(i, pdfStrip); } flag= true;
             */
        }
        catch(IOException ie) {
            ie.printStackTrace();
        }
    }

    public void createThreads() {
        counter=1;
        for(int i=0;i<=9;i++) {
            ReadPDFFile pdf = new ReadPDFFile();
            pdf.setName("Reader"+i);
            pdf.start();
        }
    }

    public void run() {
        try {
            while(counter < 10){
                int pgNum= pageCounterReentrant();
                readThisPage(pgNum, pdfStrip);
            }
            doc.close();
        }catch(Exception e) {
        }   
        flag= true;
    }

    public static int getCounter() {
        counter=  counter+1;
        return counter;
    }

    public static int pageCounterReentrant() {
        counterLock.lock();
        try {
            counter =  getCounter();
        } finally {
            counterLock.unlock();
        }
        return counter;
    }

    public static void readThisPage(int pageNum, PDFTextStripper ts) {
        counter++;
        System.out.println(Thread.currentThread().getName()+" reading page: "+pageNum+", counter: "+counter);

        synchronized(ts){
            String currentpageContent= new String();
            try {
                ts.setStartPage(pageNum);
                ts.setEndPage(pageNum);
                System.out.println("-->"+ts.getPageEnd());
                currentpageContent = ts.getText(doc);
                currentpageContent = currentpageContent.substring(0, 10);
                System.out.println("\n\n "+currentpageContent);
            }

        /*
         * further operations on currentpageContent here
         */

            catch(IOException io) {
                io.printStackTrace();
            }finally {
            }
        } 
    }

    public static void printFinalResult(ConcurrentHashMap<Integer, List<String>> map) {
        /*
         * simply display content of ConcurrentHashMap
         */
    }

    public static void main(String[] args) throws FileNotFoundException {
        Scanner sc = new Scanner(System.in); 
        System.out.println("Search Word");
        userInput = sc.nextLine(); 
        System.out.println("Confidence"); 
        userConfidence = sc.nextInt(); 
        ReadPDFFile pef = new ReadPDFFile("file path", "file name",userInput, userConfidence);
        pef.createFileInstance("file path ","file name");
        if(flag==true)
            printFinalResult(map);
     }
}

如果我使用一个线程按顺序读取for循环中的每一页,则它可以打印内容,但不能打印多线程。您可以在this.createThreads();之后看到在void createFileInstance()中注释的代码。我希望使用线程分别获取每个pdf页面的字符串内容,然后对其执行操作。我有逻辑将每个单词标记收集到List中,但是在继续之前,我需要解决此问题。

java apache pdfbox
1个回答
0
投票
try(PDDocument document = PDDocument.load(new File(filePath))){ doc = document; .... this.createThreads(); } // document gets closed here ... //threads that do text extraction still running here (and using a closed object)

这些线程使用doc提取文本(ts.getText(doc))。但是,此时,由于使用try-with-resources,PDDocument对象已经关闭,并且其流也关闭了。因此,错误消息“也许其封闭的PDDocument已关闭?”。

您应该在关闭文档之前创建线程,并在关闭所有线程之前等待所有线程完成。

我建议不要在一个PDDocument上使用多线程,请参阅PDFBOX-4559。您可以创建多个PDDocument,然后从中提取或根本不提取。文本提取在PDFBox中工作非常快(与渲染相比)。

© www.soinside.com 2019 - 2024. All rights reserved.