PDF.js - 如何获取每个字符的信息而不是文本块的信息？

Question

通过使用 PDF.js 可以获取每个文本块的信息，如下所示，但一些空格分隔的字符被错误地处理为一个块。我想要处理的 pdf 使用等宽字体，因此如果我可以获得每个字符的信息，这允许我通过计算它们的坐标来检测空间，那就很方便了。有没有参数方法可以做到这一点？

const base64 = "data:application/pdf;base64,******";
const loadingTask = pdfjsLib.getDocument({ data: atob(base64.replace(/^.*,/, '')) });
const pdfDocument = await loadingTask.promise;
const page = await pdfDocument.getPage(1);
const textContent = await page.getTextContent();
console.log(textContent.items);
// [
//   {
//     "str": "Sample",
//     "dir": "ltr",
//     "width": 78.39999999999999,
//     "height": 9.8,
//     "transform": [
//         9.8,
//         0,
//         0,
//         9.8,
//         58.8,
//         721.0800000000004
//     ],
//     "fontName": "g_d0_f1",
//     "hasEOL": false
//   },
//   ...
// ]

Answer 1

是的，可以，但是需要在pdf.worker.js的源代码中实现。

在方法“buildTextContentItem”中，您可以获取文本块的每个字符，包括其unicode和转换，然后您可以将它们推入数组并在文本项中返回。

示例：

textChunk.charsArray.push({
  char: glyph.fontChar,
  unicode: glyph.unicode,
  transform: textChunk.prevTransform,
  str: textChunk.str.join("")
})

将上面的代码添加到该方法的最后。测试版本：3.8.162

Answer 2

您可以使用 PDF.js 中的 getTextContent() 方法将 PDF 页面的文本内容提取为字符流，而不是文本块。然后，您可以迭代字符并执行任何所需的处理，例如提取有关每个字符的信息。这是一个例子

// Load the PDF document
pdfjsLib.getDocument(pdfUrl).promise.then(function (pdfDoc) {
  // Get the first page
  pdfDoc.getPage(1).then(function (page) {
    // Get the text content as a stream of characters
    page.getTextContent().then(function (textContent) {
      // Iterate over the characters
      for (var i = 0; i < textContent.items.length; i++) {
        var char = textContent.items[i];
        // Perform any desired processing on the character
        console.log(char.str, char.width, char.height);
      }
    });
  });
});

PDF.js - 如何获取每个字符的信息而不是文本块的信息？

问题描述投票：0回答：2

2个回答

最新问题

PDF.js - 如何获取每个字符的信息而不是文本块的信息？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2