iText7 以错误的顺序朗读行（2）

Question

我正在使用 iText 7.2.2。我正在尝试从一些来自扫描过程的 PDF 中提取文本。

我有一个使用 ocrmypdf 处理的 pdf，以添加“OCR-ed”文本层。

下图为原文（OCR前）

当我提取文本时，我得到换行符、许多空格（此处被修剪）并且单词顺序错误。代码：

"NAME  : [$($name)]"

出品：

NAME  : [NARANJAS
HERNANDEZ
C.V.
S.A. DE]

顺序错了，不能简单的加入队列

我发现this post听起来很有希望，但我的情况最终有点不同。实施它的建议，结果如下：

NAME  : [NARANJAS HERNANDEZ C.V.
S.A. DE]

以下代码来自this帖子：

  public virtual bool SameLine(ITextChunkLocation @as)
  {
//      Console.WriteLine("OrientationMagnitude: " + OrientationMagnitude() + ", as.OrientationMagnitude: " + @as.OrientationMagnitude());
      if (OrientationMagnitude() != @as.OrientationMagnitude())
      {
          return false;
      }
      int distPerpendicularDiff = DistPerpendicular() - @as.DistPerpendicular();
      if (Math.Abs(distPerpendicularDiff) < 5)
      {
          return true;
      }
      LineSegment mySegment = new LineSegment(startLocation, endLocation);
      LineSegment otherSegment = new LineSegment(@as.GetStartLocation(), @as.GetEndLocation());
      return Math.Abs(distPerpendicularDiff) <= DIACRITICAL_MARKS_ALLOWED_VERTICAL_DEVIATION && (mySegment.GetLength() == 0 || otherSegment.GetLength() == 0);
  }

注释行产生以下内容，当没有注释时，当然 :)

OrientationMagnitude: 6, as.OrientationMagnitude: 0
OrientationMagnitude: 6, as.OrientationMagnitude: 6
OrientationMagnitude: 6, as.OrientationMagnitude: 6
OrientationMagnitude: 7, as.OrientationMagnitude: 6
OrientationMagnitude: 7, as.OrientationMagnitude: 7
NOMBRE  : [NARANJAS HERNANDEZ C.V.
S.A. DE]

这是我能做到的。

rups 可执行文件显示：

任何关于如何解决这个问题的建议，将不胜感激。

Answer 1

正如您在控制台打印输出中看到的那样，不同文本块的方向值不同，这导致

SameLine

返回

false

和文本提取结果将不同方向的文本提取为不同的文本行。

因此，我会尝试使方向比较更宽松，就像 distPerpendicular 比较已经变得更宽松一样。

例如，在

SameLine

替换

if (OrientationMagnitude() != @as.OrientationMagnitude())

by

int orientationMagnitudeDiff = OrientationMagnitude() - @as.OrientationMagnitude();
if (Math.Abs(orientationMagnitudeDiff) > 3)

（您可能想尝试并稍微更改值“3”）。

Answer 2

我用 same post 修复了相同的行问题，但顺序仍然错误。我使用 qpdf 解码 PDF，使用此命令处理各个行，然后在 Notepad++ 中进行编辑。

qpdf --qdf --object-streams=disable --decode-level=all in.pdf decoded.pdf

我发现将受影响的线的高度向量更改为 1 会修复它，来自

1 0 0 -1 19 234 Tm

到

1 0 0 -1 19 235 Tm

我在

CompareTo

的

LocationTextExtractionStrategy

函数中添加了一点 flex，这为我解决了这个问题：

 public int CompareTo(ITextChunkLocation other)   {
        if (this == other)
        {
            return 0;
        }

        int num = CompareInts(orientationMagnitude, other.OrientationMagnitude);
        if (num != 0)
        {
            return num;
        }

        var diff = Math.Abs(distPerpendicular - other.DistPerpendicular);
        if (diff > 2)
        {
            return CompareInts(distPerpendicular, other.DistPerpendicular);
        }

        return (!(distParallelStart < other.DistParallelStart)) ? 1 : (-1);
    }

这是使用 iText 7.1.15 和 TextSharp 5.5.13.2

Answer 3

试试 Docotic.Pdf。在我从 iText7 切换到 Docotic.Pdf 之前，我遇到了同样的问题。效果很好！

iText7 以错误的顺序朗读行（2）

问题描述投票：0回答：3

3个回答

最新问题

iText7 以错误的顺序朗读行（2）

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3