Apache PdfBox：关于坐标的困惑

Question

我尝试从PDF中提取一些文本。为此，我需要定义一个包含文本的矩形。

我比较从文本提取到图形坐标的坐标时，坐标的含义可能有所不同。

package MyTest.MyTest;

import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.PDPageContentStream.*;
import org.apache.pdfbox.text.*;
import java.awt.*;
import java.io.*;

public class MyTest 
{   
  public static void main (String [] args) throws Exception
  { 
    PDDocument pd = PDDocument.load (new File ("my.pdf"));  
    PDFTextStripperByArea st = new PDFTextStripperByArea ();
    PDPage pg = pd.getPage (0);

    float h = pg.getMediaBox ().getHeight ();
    float w = pg.getMediaBox ().getWidth ();
    System.out.println (h + " x " + w + " in internal units");
    h = h / 72 * 2.54f * 10;
    w = w / 72 * 2.54f * 10;
    System.out.println (h + " x " + w + " in mm");



    int X = 85;
    int Y = 175;
    int dX = 250;
    int dY = 15;

    // extract some text
    st.addRegion ("a", new Rectangle (X, Y, dX, dY));
    st.extractRegions (pg);
    String text = st.getTextForRegion ("a");
    System.out.println("text="+text);


    // fill a rectangle
    PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
    contents.setNonStrokingColor (Color.RED);  
    contents.addRect (X, Y, dX, dY);
    contents.fill ();
    contents.close ();
    pd.save ("x.pdf");
  }
}

我提取的文本（控制台中的text =输出）不是我用红色矩形覆盖的文本（生成的x.pdf）。

为什么？

要进行测试，请尝试已有的PDF。为了避免针对带有文本的矩形进行大量尝试/错误，请使用包含大量文本的文件。

Answer 1

您的方法中（至少）存在两个问题：

不同的坐标系

您使用st.addRegion。它的JavaDoc注释告诉我们：

/**
 * Add a new region to group text by.
 *
 * @param regionName The name of the region.
 * @param rect The rectangle area to retrieve the text from. The y-coordinates are java
 * coordinates (y == 0 is top), not PDF coordinates (y == 0 is bottom).
 */
public void addRegion( String regionName, Rectangle2D rect )

（实际上，PDFBox的整个文本提取设备都使用其自己的坐标系，并且由于引起的烦恼，在堆栈溢出方面已经存在很多问题。]

另一方面，contents.addRect不使用那些“ java坐标”。因此，您必须从最大裁剪框的y坐标中减去在文本提取中使用的y坐标，以获取addRect的坐标。

此外，区域矩形的锚点位于左上角，而普通PDF矩形（如用contents.addRect定义的矩形）则位于左下角。因此，您还必须从y坐标中添加或减去矩形高度。

实际上，您可能也必须更改x坐标。它没有被镜像，但是可能会有偏移，PDFBox文本提取坐标系将x = 0用于左页面边框，但在PDF用户空间中不一定是这种情况。因此，您可能必须将裁剪框的左边框x坐标添加到文本提取x坐标。

可能更改的坐标系

在页面内容流中，可能已通过将变换应用于当前变换矩阵来更改坐标系。因此，您所附说明中的坐标可能具有与上文所述不同的含义。

要排除这种影响，您应该使用其他PDPageContentStream构造函数以及附加的boolean resetContext参数：

/**
 * Create a new PDPage content stream.
 *
 * @param document The document the page is part of.
 * @param sourcePage The page to write the contents to.
 * @param appendContent Indicates whether content will be overwritten, appended or prepended.
 * @param compress Tell if the content stream should compress the page contents.
 * @param resetContext Tell if the graphic context should be reset. This is only relevant when
 * the appendContent parameter is set to {@link AppendMode#APPEND}. You should use this when
 * appending to an existing stream, because the existing stream may have changed graphic
 * properties (e.g. scaling, rotation).
 * @throws IOException If there is an error writing to the page contents.
 */
public PDPageContentStream(PDDocument document, PDPage sourcePage, AppendMode appendContent,
                           boolean compress, boolean resetContext) throws IOException
即替换

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false);
作者

PDPageContentStream contents = new PDPageContentStream (pd, pg,AppendMode.APPEND, false, false);

Apache PdfBox：关于坐标的困惑

问题描述投票：0回答：1

1个回答

不同的坐标系

可能更改的坐标系

最新问题

Apache PdfBox：关于坐标的困惑

问题描述 投票：0回答：1

1个回答

不同的坐标系

可能更改的坐标系

最新问题

问题描述投票：0回答：1