如果使用OCR [pdfbox]从扫描的文档中创建PDF，则进行检测

Question

我想知道是否使用OCR从扫描的文档创建了PDF。

为了使扫描文档中的文本成为可选择的，我想用透明颜色，特殊字体写出相同的文本，...

我使用的是pdfbox，我查看了字体，颜色和许多其他属性，但没有发现特别之处。

Answer 1

在我的情况下，文本呈现模式设置为“既不填充也不描边文本”。

pdfbox代码：

getGraphicsState().getTextState().getRenderingMode() == PDTextState.RENDERING_MODE_NEITHER_FILL_NOR_STROKE_TEXT

Answer 2

[在大多数情况下，原始图像仍然存在，并且OCRd文本在下面不可见。

因此，一种可能就是找出是否有一张图片用文字覆盖了所有区域。

另一种可能性是查看字体并根据它们做出一些明智的决定

Answer 3

我创建了一个脚本来检测PDF是否为OCRd。主要思想是：在OCRd PDF中，文本是不可见的。

用于测试给定PDF（f1）是否为OCRd的算法：

创建标记为f1的f2的副本
删除f2上的所有文本
为f1和f2的所有（或仅几个）页面创建图像（PNG）>
如果f1和f1的所有图像都相同，则[f2为OCRd。

https://github.com/jfilter/pdf-scripts/blob/master/is_ocrd_pdf.sh

#!/usr/bin/env bash
set -e
set -x

################################################################################
# Check if a PDF was scanned or created digitally, works on OCRd PDFs
#
# Usage:
#   bash is_scanned_pdf.sh [-p] file
#
#   Exit 0: Yes, file is a scanned PDF
#   Exit 99: No, file was created digitally
#
# Arguments:
#   -p or --pages: pos. integer, only consider first N pages
#
# Please report issues at https://github.com/jfilter/pdf-scripts/issues
#
# GPLv3, Copyright (c) 2020 Johannes Filter
################################################################################

# parse arguments
# h/t https://stackoverflow.com/a/33826763/4028896
max_pages=-1
# skip over positional argument of the file(s), thus -gt 1
while [[ "$#" -gt 1 ]]; do
  case $1 in
  -p | --pages)
    max_pages="$2"
    shift
    ;;
  *)
    echo "Unknown parameter passed: $1"
    exit 1
    ;;
  esac
  shift
done

# increment to make it easier with page numbering
max_pages=$((max_pages++))

command_exists() {
  if ! [ -x $($(command -v $1 &>/dev/null)) ]; then
    echo $(error: $1 is not installed.) >&2
    exit 1
  fi
}

command_exists mutool && command_exists gs && command_exists compare
command_exists pdfinfo

orig=$PWD
num_pages=$(pdfinfo $1 | grep Pages | awk '{print $2}')

echo $num_pages

echo $max_pages

if ((($max_pages > 1) && ($max_pages < $num_pages))); then
  num_pages=$max_pages
fi

cd $(mktemp -d)

for ((i = 1; i <= num_pages; i++)); do
  mkdir -p output/$i && echo $i
done

# important to filter text on output of GS (tmp1), cuz GS alters input PDF...
gs -o tmp1.pdf -sDEVICE=pdfwrite -dLastPage=$num_pages $1 &>/dev/null
gs -o tmp2.pdf -sDEVICE=pdfwrite -dFILTERTEXT tmp1.pdf &>/dev/null
mutool convert -o output/%d/1.png tmp1.pdf 2>/dev/null
mutool convert -o output/%d/2.png tmp2.pdf 2>/dev/null

for ((i = 1; i <= num_pages; i++)); do
  echo $i
  # difference in pixels, if 0 there are the same pictures
  # discard diff image
  if ! compare -metric AE output/$i/1.png output/$i/2.png null: 2>&1; then
    echo " pixels difference, not a scanned PDF, mismatch on page $i"
    exit 99
  fi
done

如果使用OCR [pdfbox]从扫描的文档中创建PDF，则进行检测

问题描述投票：0回答：3

3个回答

最新问题

如果使用OCR [pdfbox]从扫描的文档中创建PDF，则进行检测

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3