pdf 文件的 md5 校验和

Question

请查看以下问题。

1 - Applying the MD5 on a .txt file containing "Hello" (without quotes, length = 5). It gives some hash value (say h1).
2 - Now file content are changed to "Hello " ( without quotes, length = 6). It gives some hash value (say h2).
3 - Now file is changed to "Hello" (exactly as step. 1). Now the hash is h1. Which makes sense.

现在，如果将程序应用于 .pdf 文件，就会出现问题。在这里，我没有更改文件内容，而是更改文本的颜色并再次恢复为原始文件。通过这种方式，我得到了三个不同的哈希值。

那么，是因为pdf阅读器对文本和元数据进行编码的方式、哈希值不同还是类比本身是错误的？

信息：- 在 Windows 中使用免费软件来计算哈希值。

Answer 1

那么，是因为pdf阅读器对文本和元数据进行编码的方式、哈希值不同还是类比本身是错误的？

正确。如果您需要在自己的数据上进行测试，请在文本编辑器中打开任何 PDF（我使用 Notepad++）并滚动到底部（存储元数据的位置）。您会看到类似以下内容：

<</Subject (Shipping Documents)
/CreationDate (D:20150630070941-06'00')
/Title (Shipping Documents)
/Author (SomeAuthor)
/Producer (iText by lowagie.com \(r0.99 - paulo118\))
/ModDate (D:20150630070941-06'00')
>>

显然，

/CreationDate

和

ModDate

至少会继续改变。即使您使用相同的源数据从某个源重新生成 pdf，这些时间戳也会有意义地更改目标 pdf 的校验和。

Answer 2

正确，由于文件中存储的某些元数据（如

ModDate

），看起来完全相同的 PDF 可能具有不同的校验和。我需要检测“看起来”相同的 PDF，所以我编写了一个有点 hacky 的 Javascript 函数。这不能保证有效，但至少它有时会检测到重复项（正常校验和很少会检测到重复的 pdf）。您可以在此处阅读有关 PDF 格式的更多信息 https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf 并在此相关的 SO 问题中查看一些类似的解决方案为什么重复爆裂通过 pdftk 将多页 PDF 转换为各个页面更改这些页面的 md5 校验和？

    /**
     * The PDF format is weird, and contains various header information and other metadata.
     * Most (all?) actual pdf contents appear between keywords `stream` and `endstream`.
     * So, to ignore metadata, this function just extracts any contents between "stream" and "endstream".
     * This is not guaranteed to find _all_ contents, but it _should_ ignore all metadata.
     * Useful for generating checksums.
     */
    private getRawContent(buffer: Buffer): string {
        const str = buffer.toString();
        // FIXME: If the binary stream itself happens to contain "endstream" or "ModDate", this wont work.
        const streamParts = str.split('endstream').filter(x => !x.includes('ModDate'));
        if (streamParts.length === 0) {
            return str;
        }
        const rawContent: string[] = [];
        for (const streamPart of streamParts) {
            // Ignore everything before the first `stream`
            const streamMatchIndex = streamPart.indexOf('stream');
            if (streamMatchIndex >= 0) {
                const contentStartIndex = streamMatchIndex + 'stream'.length;
                const rawPartContent = streamPart.substring(contentStartIndex);
                rawContent.push(rawPartContent);
            }
        }
        return rawContent.join('\n');
    }

pdf 文件的 md5 校验和

问题描述投票：0回答：2

2个回答

最新问题

pdf 文件的 md5 校验和

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2