PDF 字体名称中带有空格会导致打印机错误

问题描述 投票:0回答:1

一些背景:我维护着一个大部分未索引的科学文献档案,在这种情况下,使用扫描纸质文档和随后的 OCR 来生成可搜索的文本。这种方法一直很有效,直到大学改用不具备 OCR 功能的打印机。然后我不得不撤退并依靠单独的扫描和OCR。为此,我选择使用 Adobe Acrobat Pro。它似乎工作得很好,直到有一天我意识到我无法打印我一直在处理的一些文档(从 Mac Preview 打印,而不是 Adobe Acrobat)。打印机 (Ricoh IM C4500) 的错误消息是:

  ERROR: undefined
  OFFENSIVE COMMAND: New
  STACK:
  /AAAAAC+*Times
  /FontName

我对PDF的理解有限,但是通过首先打印PS(仍然是预览版),然后使用Adobe Distiller重新生成PDF,我能够重现导致我将所有字体名称“用空格”替换为字体的错误像这样命名“带破折号”(在 PS 中):

  Times New Roman -> Times-New-Roman

这让 Adobe Distiller 很高兴,重新生成的 PDF 可以毫无问题地打印。然后我尝试对

pdf2ps
ps2pdf
做同样的事情。有趣的是,这两个程序一起工作并解决了问题,而无需我像上面那样手动干预。

此时我应该向您展示 MWE,但我不知道该怎么做。不是图片中的PDF。另外,我认为问题的原因已经很清楚了。这两个问题是:

  1. 如何修复已经染色的文件?
  2. 以后如何避免字体名称问题?

存档中的文件很多,如果不使用命令行,我看不到可行的解决方案。这对我来说很好。例如。通过运行

pdffonts
(如如何找出引用了哪些字体以及 PDF 文档中嵌入了哪些字体中所述)来获取 PDF 中使用的字体列表。但是,如果 PDF 的字体名称带有空格,我该如何继续“编辑”它呢?我认为该文件需要以某种方式重建,但在这里我真的需要一些建议。对我来说,例如GhostScript 是此级别清理 PDF 文件的理想选择,但这可能很幼稚。

pdf printing ocr acrobat
1个回答
0
投票

我希望我可以回答我自己的问题。如果没有,请告诉我如何继续...

按照 @K J 和 @johnwhitington 的建议,我最终根据

pdffonts
qpdf
xxd
gs
的混合使用编写了一个 BASH 脚本。这个想法是生成一个可编辑的 pdf (
qpdf
),将其转换为十六进制代码 (
xxd
),并对字体名称模式执行相同的操作,执行简单的
sed
替换,转换回
qpdf
最后清理使用
gs
的 pdf 格式。我对幕后发生的事情只有一点点了解,但希望有人能在这篇文章中解释它。

下面显示的代码回答了我的问题 1。关于问题 2,除了每次在 Adobe Acrobat Pro 中执行 OCR 时运行相同的脚本之外,我没有其他选择。 AA 的替代方案会很棒,但我还没有看到很多(而且我自己没有时间训练 Tesseract。)到目前为止,该脚本已经在来自大约 100 个不同创建者的大约 1000 个文件上进行了测试,并且似乎很稳定,除了对于一个已知的错误,即 pdf 中混合使用文字空间“ ”和十六进制空间 #20(到目前为止我还没有观察到。)

#!bin/bash
#script tested on GNU bash, version 5.2.21(1)
#
# ------------------------------------------------------------------------------
# Purpose: Use output from "pdffonts" to patch up any font names "with spaces", 
#          if occurent in PDF file. Such files have been observed to crash the 
#          (Postscript) printer when printed from Mac Preview while working OK 
#          when printed from Adobe Acrobat. The spaces can be encoded either as 
#          literal ' ' or as hexadecimal #20 (but not mixed usage). The two 
#          forms will be replaced by '-' or #2D respectively; in the patched 
#          pdf file which is generated by Ghostscript. No further changes are 
#          made to PDF, but other hidden issues/warnings/errors with the fonts 
#          may also come to light using "pdffonts", so watch out for any extra 
#          output from stderr. 
#
#          The heuristic of the font name patch works along these lines:
#
#          1. Use pdffonts to make list of font names "with spaces"
#          2. Run qpdf on PDF to make qpdf file format
#          3. Run xxd on 2. to make hexadecimal text without (extra) newlines
#          4. Run xxd on 1. to transform into hexadecimal search patterns
#          5. Use sed to substitute patterns 4. in file 3.
#          6. Run xxd to transform 5. back to qpdf format
#          7. Run gs on 6. to make patched pdf 
#
# Names  : SRCDIR (source directory for the PDFs)
#          OUTDIR (output work directory)
#          PDF    (running pdf filename, traversed in depth by "find")
#          JOB    (stem of output filenames in OUTDIR, see next item below)
#          FONTS  (unique list of font names "with spaces")
#          PAT    (element in FONTS, iterator)
#          PATx   (transformed hexadecimal PATs for specific use x, see code)
#
# Output : After a successful font name patch there will be a number of output
#          files stored in OUTDIR:
#
#          JOB.qdf.pdf     (output from qpdf)
#          JOB.xxd.qdf.pdf (output from xxd)
#          JOB.xxd.txt     (same as JOB.xxd.qdf.pdf, but without newlines)
#          JOB.gs.qdf.pdf  (output from gs, the patched PDF)
#          JOB.log         (log-file)
#
# Author : Tore Haug-Warberg
# Since  : 2024-01-17
# Note   : The regex used to isolate the font names is tailored to pdffonts 
#          v3.03 which states there are three different kinds of fonts:
#          - Type 1
#          - Type 1C - aka Compact Font Format (CFF)
#          - Type 3
#          - TrueType
#          - CID Type 0 - 16-bit font with no specified type
#          - CID Type 0C - 16-bit PostScript CFF font
#          - CID TrueType - 16-bit TrueType font
#          This info is encoded as (CID)?[ ](True|Type).*$) in the code below
# Usage  : Change file destinations RCDIR, OUTDIR; change maybe the command 
#          'find -s "$SRCDIR" -iname ...' to your needs; run script
# ------------------------------------------------------------------------------

SRCDIR=~/Foo/
OUTDIR=~/Bar/

find -s "$SRCDIR" -iname "*.pdf" -not -iname "*_orig.pdf" | \
while read PDF; 
do 
  JOB="$OUTDIR/$(basename "$PDF" '.pdf')"; 
  echo "$PDF"; 

  # Scan output from pdffonts looking for font names "with spaces". There is no
  # grammar for this and the sed pattern used below will sometimes fail: Only
  # the part of the font name which consists of alphanumeric text (plus space)
  # are recognized by sed | sort | uniq (for speeding up the text processing).
  # However, if the sed pattern fails all font names containing spaces will
  # still undergo space substitution, it just takes more time
  IFS=$'\n' \
  FONTS=$(pdffonts "$PDF" | \
          sed -n '3,$p' | \
          sed -E 's/^(.*)([ ]+CID[ ]+(True|Type).*$)/\1/' | \
          sed -E 's/^(.*)([ ]+(True|Type).*$)/\1/' | \
          sed -e 's/[ ]*$//' | \
          grep -e '[ ]' | \
          sed -E 's/^[^[:alnum:]]*([[:alnum:]]+[ ][[:alnum:]\ ]+).*$/\1/' | \
          sort | uniq);

  # Test that there are no font names "with spaces" in the list
  if [[ "" == "$FONTS" ]];
  then
    continue
  else
    echo "$PDF" > "$JOB".log; 
    echo "$FONTS" >> "$JOB".log; 
  fi

  # Transform PDF, first to qpdf-format and then to hexadecimal text with no
  # (extra) newlines. So that we can run 'sed' on the entire shebang without
  # knowing the qpdf file structure. This only works for PDFs of modest size,
  # but at least a few tenths of MB works fine
  qpdf --qdf "$PDF" "$JOB".qdf.pdf;
  xxd -p -u "$JOB".qdf.pdf | tr -d ' \n' > "$JOB".xxd.txt;

  # Transform PAT into hexadecimal search patterns PATa and PATb (for font names
  # spelled with literal space ' ') and patterns PATc and PATd (for font names
  # spelled with hexadecimal #20). Both alternatives must be tested because
  # pdffonts outputs literal space even if #20 is used in PDF. The simple test
  # therefore only works if there is no mixed use of ' ' and #20 in PDF. In 
  # which case it will fail to replace anything at all.
  for PAT in $FONTS;    
  do 
    PATa=$(echo "$PAT"                        | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATb=$(echo "$PAT" | sed -e 's/[ ]/-/g'   | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATc=$(echo "$PAT" | sed -e 's/[ ]/#20/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
    PATd=$(echo "$PAT" | sed -e 's/[ ]/#2D/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
    sed -i '' -e "s/$PATa/$PATb/g" "$JOB".xxd.txt;
    sed -i '' -e "s/$PATc/$PATd/g" "$JOB".xxd.txt;
  done;

  # Transform from hexadecimal back to qpdf format
  xxd -ps -r "$JOB".xxd.txt "$JOB".xxd.qdf.pdf;

  # Create a new pdf with the same timestamp as the PDF but now patched so that 
  # there are no font names "with spaces"
  gs -o "$JOB".gs.qdf.pdf \
     -sDEVICE=pdfwrite \
     -dPDFSETTINGS=/default \
     "$JOB".xxd.qdf.pdf >> "$JOB".log;
  touch -r "$PDF" "$JOB".gs.qdf.pdf;

  # Log result
  pdffonts "$JOB".gs.qdf.pdf >> "$JOB".log;
  echo '--- qpdf+xxd+gs (end)'; 
done
© www.soinside.com 2019 - 2024. All rights reserved.