一些背景:我维护着一个大部分未索引的科学文献档案,在这种情况下,使用扫描纸质文档和随后的 OCR 来生成可搜索的文本。这种方法一直很有效,直到大学改用不具备 OCR 功能的打印机。然后我不得不撤退并依靠单独的扫描和OCR。为此,我选择使用 Adobe Acrobat Pro。它似乎工作得很好,直到有一天我意识到我无法打印我一直在处理的一些文档(从 Mac Preview 打印,而不是 Adobe Acrobat)。打印机 (Ricoh IM C4500) 的错误消息是:
ERROR: undefined
OFFENSIVE COMMAND: New
STACK:
/AAAAAC+*Times
/FontName
我对PDF的理解有限,但是通过首先打印PS(仍然是预览版),然后使用Adobe Distiller重新生成PDF,我能够重现导致我将所有字体名称“用空格”替换为字体的错误像这样命名“带破折号”(在 PS 中):
Times New Roman -> Times-New-Roman
这让 Adobe Distiller 很高兴,重新生成的 PDF 可以毫无问题地打印。然后我尝试对
pdf2ps
和 ps2pdf
做同样的事情。有趣的是,这两个程序一起工作并解决了问题,而无需我像上面那样手动干预。
此时我应该向您展示 MWE,但我不知道该怎么做。不是图片中的PDF。另外,我认为问题的原因已经很清楚了。这两个问题是:
存档中的文件很多,如果不使用命令行,我看不到可行的解决方案。这对我来说很好。例如。通过运行
pdffonts
(如如何找出引用了哪些字体以及 PDF 文档中嵌入了哪些字体中所述)来获取 PDF 中使用的字体列表。但是,如果 PDF 的字体名称带有空格,我该如何继续“编辑”它呢?我认为该文件需要以某种方式重建,但在这里我真的需要一些建议。对我来说,例如GhostScript 是此级别清理 PDF 文件的理想选择,但这可能很幼稚。
我希望我可以回答我自己的问题。如果没有,请告诉我如何继续...
按照 @K J 和 @johnwhitington 的建议,我最终根据
pdffonts
、qpdf
、xxd
和 gs
的混合使用编写了一个 BASH 脚本。这个想法是生成一个可编辑的 pdf (qpdf
),将其转换为十六进制代码 (xxd
),并对字体名称模式执行相同的操作,执行简单的 sed
替换,转换回 qpdf
最后清理使用 gs
的 pdf 格式。我对幕后发生的事情只有一点点了解,但希望有人能在这篇文章中解释它。
下面显示的代码回答了我的问题 1。关于问题 2,除了每次在 Adobe Acrobat Pro 中执行 OCR 时运行相同的脚本之外,我没有其他选择。 AA 的替代方案会很棒,但我还没有看到很多(而且我自己没有时间训练 Tesseract。)到目前为止,该脚本已经在来自大约 100 个不同创建者的大约 1000 个文件上进行了测试,并且似乎很稳定,除了对于一个已知的错误,即 pdf 中混合使用文字空间“ ”和十六进制空间 #20(到目前为止我还没有观察到。)
#!bin/bash
#script tested on GNU bash, version 5.2.21(1)
#
# ------------------------------------------------------------------------------
# Purpose: Use output from "pdffonts" to patch up any font names "with spaces",
# if occurent in PDF file. Such files have been observed to crash the
# (Postscript) printer when printed from Mac Preview while working OK
# when printed from Adobe Acrobat. The spaces can be encoded either as
# literal ' ' or as hexadecimal #20 (but not mixed usage). The two
# forms will be replaced by '-' or #2D respectively; in the patched
# pdf file which is generated by Ghostscript. No further changes are
# made to PDF, but other hidden issues/warnings/errors with the fonts
# may also come to light using "pdffonts", so watch out for any extra
# output from stderr.
#
# The heuristic of the font name patch works along these lines:
#
# 1. Use pdffonts to make list of font names "with spaces"
# 2. Run qpdf on PDF to make qpdf file format
# 3. Run xxd on 2. to make hexadecimal text without (extra) newlines
# 4. Run xxd on 1. to transform into hexadecimal search patterns
# 5. Use sed to substitute patterns 4. in file 3.
# 6. Run xxd to transform 5. back to qpdf format
# 7. Run gs on 6. to make patched pdf
#
# Names : SRCDIR (source directory for the PDFs)
# OUTDIR (output work directory)
# PDF (running pdf filename, traversed in depth by "find")
# JOB (stem of output filenames in OUTDIR, see next item below)
# FONTS (unique list of font names "with spaces")
# PAT (element in FONTS, iterator)
# PATx (transformed hexadecimal PATs for specific use x, see code)
#
# Output : After a successful font name patch there will be a number of output
# files stored in OUTDIR:
#
# JOB.qdf.pdf (output from qpdf)
# JOB.xxd.qdf.pdf (output from xxd)
# JOB.xxd.txt (same as JOB.xxd.qdf.pdf, but without newlines)
# JOB.gs.qdf.pdf (output from gs, the patched PDF)
# JOB.log (log-file)
#
# Author : Tore Haug-Warberg
# Since : 2024-01-17
# Note : The regex used to isolate the font names is tailored to pdffonts
# v3.03 which states there are three different kinds of fonts:
# - Type 1
# - Type 1C - aka Compact Font Format (CFF)
# - Type 3
# - TrueType
# - CID Type 0 - 16-bit font with no specified type
# - CID Type 0C - 16-bit PostScript CFF font
# - CID TrueType - 16-bit TrueType font
# This info is encoded as (CID)?[ ](True|Type).*$) in the code below
# Usage : Change file destinations RCDIR, OUTDIR; change maybe the command
# 'find -s "$SRCDIR" -iname ...' to your needs; run script
# ------------------------------------------------------------------------------
SRCDIR=~/Foo/
OUTDIR=~/Bar/
find -s "$SRCDIR" -iname "*.pdf" -not -iname "*_orig.pdf" | \
while read PDF;
do
JOB="$OUTDIR/$(basename "$PDF" '.pdf')";
echo "$PDF";
# Scan output from pdffonts looking for font names "with spaces". There is no
# grammar for this and the sed pattern used below will sometimes fail: Only
# the part of the font name which consists of alphanumeric text (plus space)
# are recognized by sed | sort | uniq (for speeding up the text processing).
# However, if the sed pattern fails all font names containing spaces will
# still undergo space substitution, it just takes more time
IFS=$'\n' \
FONTS=$(pdffonts "$PDF" | \
sed -n '3,$p' | \
sed -E 's/^(.*)([ ]+CID[ ]+(True|Type).*$)/\1/' | \
sed -E 's/^(.*)([ ]+(True|Type).*$)/\1/' | \
sed -e 's/[ ]*$//' | \
grep -e '[ ]' | \
sed -E 's/^[^[:alnum:]]*([[:alnum:]]+[ ][[:alnum:]\ ]+).*$/\1/' | \
sort | uniq);
# Test that there are no font names "with spaces" in the list
if [[ "" == "$FONTS" ]];
then
continue
else
echo "$PDF" > "$JOB".log;
echo "$FONTS" >> "$JOB".log;
fi
# Transform PDF, first to qpdf-format and then to hexadecimal text with no
# (extra) newlines. So that we can run 'sed' on the entire shebang without
# knowing the qpdf file structure. This only works for PDFs of modest size,
# but at least a few tenths of MB works fine
qpdf --qdf "$PDF" "$JOB".qdf.pdf;
xxd -p -u "$JOB".qdf.pdf | tr -d ' \n' > "$JOB".xxd.txt;
# Transform PAT into hexadecimal search patterns PATa and PATb (for font names
# spelled with literal space ' ') and patterns PATc and PATd (for font names
# spelled with hexadecimal #20). Both alternatives must be tested because
# pdffonts outputs literal space even if #20 is used in PDF. The simple test
# therefore only works if there is no mixed use of ' ' and #20 in PDF. In
# which case it will fail to replace anything at all.
for PAT in $FONTS;
do
PATa=$(echo "$PAT" | tr -d '\n' | xxd -p -u | tr -d ' \n');
PATb=$(echo "$PAT" | sed -e 's/[ ]/-/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
PATc=$(echo "$PAT" | sed -e 's/[ ]/#20/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
PATd=$(echo "$PAT" | sed -e 's/[ ]/#2D/g' | tr -d '\n' | xxd -p -u | tr -d ' \n');
sed -i '' -e "s/$PATa/$PATb/g" "$JOB".xxd.txt;
sed -i '' -e "s/$PATc/$PATd/g" "$JOB".xxd.txt;
done;
# Transform from hexadecimal back to qpdf format
xxd -ps -r "$JOB".xxd.txt "$JOB".xxd.qdf.pdf;
# Create a new pdf with the same timestamp as the PDF but now patched so that
# there are no font names "with spaces"
gs -o "$JOB".gs.qdf.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/default \
"$JOB".xxd.qdf.pdf >> "$JOB".log;
touch -r "$PDF" "$JOB".gs.qdf.pdf;
# Log result
pdffonts "$JOB".gs.qdf.pdf >> "$JOB".log;
echo '--- qpdf+xxd+gs (end)';
done