使用PHP从任何PDF中检索字体列表

Question

我正在尝试检索上载PDF中使用的所有字体的列表。以下脚本适用于某些PDF，但是当字体在PDF的同一行中列出时，该脚本找不到下一个字体，仅找到第一个字体。

$box="/BaseFont\ ?.*/";
$stream = new SplFileObject($pdffile); 
while (!$stream->eof()) {
    if (preg_match_all($box, $stream->fgets(), $matches)) {
        for ($i = 0; $i < count($matches[0]); $i++) {
            $newfont = substr($matches[0][$i], (strpos($matches[0][$i],"+") + 1));
            if (strpos($newfont,"/") > 0)
                {
                $newfont = str_replace(' ', '',substr($newfont, 0, strpos($newfont,"/")));
                $newfont = str_replace(array("\r", "\n"), '',$newfont);
                }
            if (!in_array($newfont, $fonts))
                {
                $fontcount = $fontcount + 1;
                echo $newfont."<br>";
                array_push($fonts,$newfont);
                }
        }
    }
}
$stream = null;

PDF样本：

<</BaseFont/CMYBYX+Wingdings-Regular/DescendantFonts 27 0 R/Encoding/Identity-H/Subtype/Type0/ToUnicode 28 0 R/Type/Font>>
endobj
14 0 obj
<</BaseFont/CMYBYX+Roboto-Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 30 0 R/LastChar 116/Subtype/TrueType/ToUnicode 31 0 R/Type/Font/Widths[249 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 873 708 0 651 0 0 634 0 0 636 0 0 0 0 0 0 0 0 0 0 0 0 518 0 529 0 564 564 267 0 0 267 0 564 564 0 0 0 516 349]>>
endobj
15 0 obj
<</BaseFont/CMYBYX+Roboto-Light/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 33 0 R/LastChar 127/Subtype/TrueType/ToUnicode 34 0 R/Type/Font/Widths[243 0 0 0 0 0 0 0 0 0 0 0 0 0 239 397 583 554 554 554 554 0 554 554 554 554 0 0 0 0 0 0 913 625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 538 557 518 0 515 329 557 557 227 0 0 227 886 557 557 557 0 340 509 332 557 0 757 0 0 0 0 0 0 0 323]>>
endobj

实际上是：

<</BaseFont/CMYBYX+Wingdings-Regular/DescendantFonts 27 0 R/Encoding/Identity-H/Subtype/Type0/ToUnicode 28 0 R/Type/Font>>endobj14 0 obj<</BaseFont/CMYBYX+Roboto-Bold/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 30 0 R/LastChar 116/Subtype/TrueType/ToUnicode 31 0 R/Type/Font/Widths[249 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 873 708 0 651 0 0 634 0 0 636 0 0 0 0 0 0 0 0 0 0 0 0 518 0 529 0 564 564 267 0 0 267 0 564 564 0 0 0 516 349]>>endobj15 0 obj<</BaseFont/CMYBYX+Roboto-Light/Encoding/WinAnsiEncoding/FirstChar 32/FontDescriptor 33 0 R/LastChar 127/Subtype/TrueType/ToUnicode 34 0 R/Type/Font/Widths[243 0 0 0 0 0 0 0 0 0 0 0 0 0 239 397 583 554 554 554 554 0 554 554 554 554 0 0 0 0 0 0 913 625 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 538 557 518 0 515 329 557 557 227 0 0 227 886 557 557 557 0 340 509 332 557 0 757 0 0 0 0 0 0 0 323]>>endobj

我的最后一个解决方案是逐行读取文件，并逐行搜索字体。但这会大大增加处理时间。

有人提出建议吗？不同的模式（尝试了几次）？

Answer 1

将模式更改为：

$box="/BaseFont\ ?.*>>/U";

说明：

最重要的更改是最后的U，它称为模式修饰符-请参见Pattern Modifiers-它用于使匹配不贪心，因此它不匹配整行，因为它响应模式，但只有找到>>
谈论>>，这是您的模式的第二个加法，有必要使它与整个BaseFont定义匹配，否则，它仅与单词BaseFont匹配，因为匹配是非贪婪的。

使用PHP从任何PDF中检索字体列表

问题描述投票：0回答：1

1个回答

最新问题

使用PHP从任何PDF中检索字体列表

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1