如何用Java中的PDFBox删除超链接

Question

我正在尝试使用Java中的PDFBox从PDF中删除所有超链接。只需要纯文本。

    public static void main(String[] args) throws IOException {
        File pdfFile = new File("link.pdf");
        List<PDAnnotation> annotations = new ArrayList<>();


        try (PDDocument document = PDDocument.load(pdfFile)) {
            for (PDPage page : document.getPages()) {
                annotations = page.getAnnotations();
                for (int i = 0; i < annotations.size(); i++) {
                    annotations.remove(i);
                }
                }
            document.save(new File("only_fields-removeImproved.pdf"));
            }

        }
    }

此代码对我不起作用。带有超链接的PDFPDF

Answer 1

您要做的就是删除自己数组中的元素。 remove不会从实际文档中“删除”数组中的元素。我没有具体完成您要尝试的操作，但是我的搜索使我想到了...

How to delete annotations in PDF file using PDFBox

获得页面后，尝试...

page.setAnnotations(null);

编辑：嗯，没关系，我根据与您链接的答案进行了尝试，但没有成功。我只是得到了原始pdf的副本...仍然有链接。我将根据其他答案发布我尝试过的内容。不过没有用。

      try {
        PDDocument document = PDDocument.load(new File("myPdf.pdf"));
        document.setAllSecurityToBeRemoved(true);
        for (PDPage pdPage : document.getPages())
        { 
            pdPage.setAnnotations(null);
        }
        document.save(new File("myPdf.pdf"));
        document.close();
      } catch (IOException ioe) {
          System.out.println(ioe.getMessage());
      }

Answer 2

这是我仅从pdf中剥离文本的方式。但请注意，您得到的东西有些奇怪，因为您正在解析照片中的位置，而不是人为排序的文本。在某些字段中，诸如双行和单行之类的事物的顺序可能会抛出您，并且您可能需要添加custon逻辑来解析它。但是，您可以通过覆盖PDFTextStripper来获得仅包含这样的文本元素的数组。

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.TextPosition;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

/**
* This is an example on how to extract text line by line from pdf document
*/
public class GetLinesFromPDF extends PDFTextStripper {

    private static List<String> lines = new ArrayList<String>();

    public GetLinesFromPDF() throws IOException { }
    /**
     * @param filePath
     * @return 
     * @throws IOException If there is an error parsing the document.
     */
    public static void resetStaticLinesArray() throws IOException {

    }
    public static List<String> getLines( final String filePath ) throws IOException {
        PDDocument document = null;
        try {
            lines = new ArrayList<>();
            document = PDDocument.load( new File(filePath) );
            PDFTextStripper stripper = new GetLinesFromPDF();
            stripper.setSortByPosition( true );
            stripper.setStartPage( 0 );
            stripper.setEndPage( document.getNumberOfPages() );
            Writer dummy = new OutputStreamWriter(new ByteArrayOutputStream());
            stripper.writeText(document, dummy);

            // print lines
//            for(String line:lines){
//                System.out.println(line);
//            }
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }
        return lines;
    }

    /**
     * Override the default functionality of PDFTextStripper.writeString()
     * @param str
     * @param textPositions
     * @throws java.io.IOException
     */
    @Override
    protected void writeString(String str, List<TextPosition> textPositions) throws IOException {
        lines.add(str);
        // you may process the line here itself, as and when it is obtained
    }

}

我这样称呼它，静态的...

List<String> lines = GetLinesFromPDF.getLines( filePath );

如何用Java中的PDFBox删除超链接

问题描述投票：0回答：2

2个回答

最新问题

如何用Java中的PDFBox删除超链接

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2