如何在java中从pdf文档中搜索某些特定的字符串或单词及其坐标?

问题描述 投票:0回答:4

我正在使用 Pdfbox 从 PDF 文件中搜索单词(或字符串),我还想知道该单词的坐标。 例如:- 在 PDF 文件中,有一个类似“${abc}”的字符串。我想知道这个字符串的坐标。 我尝试了一些例子,但没有得到我所说的结果。 结果显示的是字符的坐标。

这是代码:

@Override
protected void writeString(String string, List<TextPosition> textPositions) throws IOException {
    for(TextPosition text : textPositions) {
      
        
        System.out.println( "String[" + text.getXDirAdj() + "," +
                text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale=" +
                text.getXScale() + " height=" + text.getHeightDir() + " space=" +
                text.getWidthOfSpace() + " width=" +
                text.getWidthDirAdj() + "]" + text.getUnicode());

    }
}

我正在使用 PdfBox 2.0。

java pdfbox
4个回答
12
投票

PDFBox'

PDFTextStripper
类仍然具有带位置的文本(在还原为纯文本之前)的最后一个方法是方法

/**
 * Write a Java string to the output stream. The default implementation will ignore the <code>textPositions</code>
 * and just calls {@link #writeString(String)}.
 *
 * @param text The text to write to the stream.
 * @param textPositions The TextPositions belonging to the text.
 * @throws IOException If there is an error when writing the text.
 */
protected void writeString(String text, List<TextPosition> textPositions) throws IOException

应该在这里拦截,因为此方法接收预处理的,特别是 sorted

TextPosition
对象(如果有人请求 sorting 开始)。

(实际上我更愿意在调用方法

writeLine
中进行拦截,该方法根据其参数和局部变量的名称拥有
line
的所有 TextPosition 实例,并且每个
writeString
调用一次
word
;但不幸的是,PDFBox 开发人员已声明此方法为私有...好吧,也许这种情况会改变,直到最终的 2.0.0 版本...轻推,轻推更新:不幸的是,它在版本中没有改变... 叹气

此外,使用辅助类将

TextPosition
实例序列包装在类似
String
的类中,可以使代码更清晰。

考虑到这一点,我们可以像这样搜索变量

List<TextPositionSequence> findSubwords(PDDocument document, int page, String searchTerm) throws IOException
{
    final List<TextPositionSequence> hits = new ArrayList<TextPositionSequence>();
    PDFTextStripper stripper = new PDFTextStripper()
    {
        @Override
        protected void writeString(String text, List<TextPosition> textPositions) throws IOException
        {
            TextPositionSequence word = new TextPositionSequence(textPositions);
            String string = word.toString();

            int fromIndex = 0;
            int index;
            while ((index = string.indexOf(searchTerm, fromIndex)) > -1)
            {
                hits.add(word.subSequence(index, index + searchTerm.length()));
                fromIndex = index + 1;
            }
            super.writeString(text, textPositions);
        }
    };
    
    stripper.setSortByPosition(true);
    stripper.setStartPage(page);
    stripper.setEndPage(page);
    stripper.getText(document);
    return hits;
}

使用这个辅助类

public class TextPositionSequence implements CharSequence
{
    public TextPositionSequence(List<TextPosition> textPositions)
    {
        this(textPositions, 0, textPositions.size());
    }

    public TextPositionSequence(List<TextPosition> textPositions, int start, int end)
    {
        this.textPositions = textPositions;
        this.start = start;
        this.end = end;
    }

    @Override
    public int length()
    {
        return end - start;
    }

    @Override
    public char charAt(int index)
    {
        TextPosition textPosition = textPositionAt(index);
        String text = textPosition.getUnicode();
        return text.charAt(0);
    }

    @Override
    public TextPositionSequence subSequence(int start, int end)
    {
        return new TextPositionSequence(textPositions, this.start + start, this.start + end);
    }

    @Override
    public String toString()
    {
        StringBuilder builder = new StringBuilder(length());
        for (int i = 0; i < length(); i++)
        {
            builder.append(charAt(i));
        }
        return builder.toString();
    }

    public TextPosition textPositionAt(int index)
    {
        return textPositions.get(start + index);
    }

    public float getX()
    {
        return textPositions.get(start).getXDirAdj();
    }

    public float getY()
    {
        return textPositions.get(start).getYDirAdj();
    }

    public float getWidth()
    {
        if (end == start)
            return 0;
        TextPosition first = textPositions.get(start);
        TextPosition last = textPositions.get(end - 1);
        return last.getWidthDirAdj() + last.getXDirAdj() - first.getXDirAdj();
    }

    final List<TextPosition> textPositions;
    final int start, end;
}

要仅输出它们的位置、宽度、最终字母和最终字母位置,您可以使用此

void printSubwords(PDDocument document, String searchTerm) throws IOException
{
    System.out.printf("* Looking for '%s'\n", searchTerm);
    for (int page = 1; page <= document.getNumberOfPages(); page++)
    {
        List<TextPositionSequence> hits = findSubwords(document, page, searchTerm);
        for (TextPositionSequence hit : hits)
        {
            TextPosition lastPosition = hit.textPositionAt(hit.length() - 1);
            System.out.printf("  Page %s at %s, %s with width %s and last letter '%s' at %s, %s\n",
                    page, hit.getX(), hit.getY(), hit.getWidth(),
                    lastPosition.getUnicode(), lastPosition.getXDirAdj(), lastPosition.getYDirAdj());
        }
    }
}

对于测试,我使用 MS Word 创建了一个小测试文件:

Sample file with variables

本次测试的输出

@Test
public void testVariables() throws IOException
{
    try (   InputStream resource = getClass().getResourceAsStream("Variables.pdf");
            PDDocument document = PDDocument.load(resource);    )
    {
        System.out.println("\nVariables.pdf\n-------------\n");
        printSubwords(document, "${var1}");
        printSubwords(document, "${var 2}");
    }
}

Variables.pdf
-------------

* Looking for '${var1}'
  Page 1 at 164.39648, 158.06 with width 34.67856 and last letter '}' at 193.22, 158.06
  Page 1 at 188.75699, 174.13995 with width 34.58806 and last letter '}' at 217.49, 174.13995
  Page 1 at 167.49583, 190.21997 with width 38.000168 and last letter '}' at 196.22, 190.21997
  Page 1 at 176.67009, 206.18 with width 35.667114 and last letter '}' at 205.49, 206.18

* Looking for '${var 2}'
  Page 1 at 164.39648, 257.65997 with width 37.078552 and last letter '}' at 195.62, 257.65997
  Page 1 at 188.75699, 273.74 with width 37.108047 and last letter '}' at 220.01, 273.74
  Page 1 at 167.49583, 289.72998 with width 40.55017 and last letter '}' at 198.74, 289.72998
  Page 1 at 176.67778, 305.81 with width 38.059418 and last letter '}' at 207.89, 305.81

我有点惊讶,因为

${var 2}
已经在单行上找到了;毕竟,PDFBox 代码让我假设我覆盖的方法
writeString
只检索words;看起来好像它检索到的行的部分比单纯的单词更长......

如果您需要来自分组的

TextPosition
实例的其他数据,只需相应地增强
TextPositionSequence
即可。


0
投票

如上所述,这不是您问题的答案,但下面是您如何在

IText
中执行此操作的框架示例。这并不是说 Pdfbox 中不可能实现同样的功能。

基本上,您创建一个

RenderListener
来接受“解析事件”发生的情况。您将此侦听器传递给
PdfReaderContentParser.processContent
。在侦听器的
renderText
方法中,您可以获得重建布局所需的所有信息,包括 x/y 坐标以及构成内容的文本/图像/...。

RenderListener listener = new RenderListener() { @Override public void renderText(TextRenderInfo arg0) { LineSegment segment = arg0.getBaseline(); int x = (int) segment.getStartPoint().get(Vector.I1); // smaller Y means closer to the BOTTOM of the page. So we negate the Y to get proper top-to-bottom ordering int y = -(int) segment.getStartPoint().get(Vector.I2); int endx = (int) segment.getEndPoint().get(Vector.I1); log.debug("renderText "+x+".."+endx+"/"+y+": "+arg0.getText()); ... } ... // other overrides }; PdfReaderContentParser p = new PdfReaderContentParser(reader); for (int i = 1; i <= reader.getNumberOfPages(); i++) { log.info("handling page "+i); p.processContent(i, listener); }
    

0
投票
我正在寻找突出显示 PDF 文件中的不同单词。为此,我需要正确了解单词坐标,因此我要做的是从左上角的第一个字母获取 (x, y) 坐标,并从最后一个字母获取 (x, y) 坐标右上角的信。

随后,将这些点保存在一个数组中。请记住,为了正确获取 y 坐标,由于给定的坐标,您需要相对于页面大小的相对位置。但

getYDirAdj()

的方法是绝对的,很多时间与页面上的不符。

protected void writeString(String string, List<TextPosition> textPositions) throws IOException { boolean isFound = false; float posXInit = 0, posXEnd = 0, posYInit = 0, posYEnd = 0, width = 0, height = 0, fontHeight = 0; String[] criteria = {"Word1", "Word2", "Word3", ....}; for (int i = 0; i < criteria.length; i++) { if (string.contains(criteria[i])) { isFound = true; } } if (isFound) { posXInit = textPositions.get(0).getXDirAdj(); posXEnd = textPositions.get(textPositions.size() - 1).getXDirAdj() + textPositions.get(textPositions.size() - 1).getWidth(); posYInit = textPositions.get(0).getPageHeight() - textPositions.get(0).getYDirAdj(); posYEnd = textPositions.get(0).getPageHeight() - textPositions.get(textPositions.size() - 1).getYDirAdj(); width = textPositions.get(0).getWidthDirAdj(); height = textPositions.get(0).getHeightDir(); System.out.println(string + "X-Init = " + posXInit + "; Y-Init = " + posYInit + "; X-End = " + posXEnd + "; Y-End = " + posYEnd + "; Font-Height = " + fontHeight); float quadPoints[] = {posXInit, posYEnd + height + 2, posXEnd, posYEnd + height + 2, posXInit, posYInit - 2, posXEnd, posYEnd - 2}; List<PDAnnotation> annotations = document.getPage(this.getCurrentPageNo() - 1).getAnnotations(); PDAnnotationTextMarkup highlight = new PDAnnotationTextMarkup(PDAnnotationTextMarkup.SUB_TYPE_HIGHLIGHT); PDRectangle position = new PDRectangle(); position.setLowerLeftX(posXInit); position.setLowerLeftY(posYEnd); position.setUpperRightX(posXEnd); position.setUpperRightY(posYEnd + height); highlight.setRectangle(position); // quadPoints is array of x,y coordinates in Z-like order (top-left, top-right, bottom-left,bottom-right) // of the area to be highlighted highlight.setQuadPoints(quadPoints); PDColor yellow = new PDColor(new float[]{1, 1, 1 / 255F}, PDDeviceRGB.INSTANCE); highlight.setColor(yellow); annotations.add(highlight); } }
    

0
投票
你可以试试这个

@Override protected void writeString(String str, List<TextPosition> textPositions) throws IOException { TextPosition startPos = textPositions.get(0); TextPosition endPos = textPositions.get(textPositions.size() - 1); System.out.println(str + " [(" + startPos.getXDirAdj() + "," + startPos.getYDirAdj() + ") ,(" + endPos.getXDirAdj() + "," + endPos.getYDirAdj() + ")]"); }
输出将类似于“String [(54.0,746.08) ,(99.71,746.08)]”

© www.soinside.com 2019 - 2024. All rights reserved.