使用Python PyPDF2从扫描的pdf(图像)中提取文本[已关闭]。

Question

我一直试图从扫描的PDF中提取文本（带有不可选择文本的图像）。

但是，我得到了一个不是人类可读的输出。

我想从pdf链接中提取包含日期、发票号的信息(https:/drive.google.comfiled1qQsqhlSKTZs-hlswrV8PIirR36896KXZview).

请帮助我在提取和存储相同的纯文本。

import PyPDF2
from PIL import Image
pdf_reader = PyPDF2.PdfFileReader(r'document.pdf', 'rb')
page = pdf_reader.getPage(85)
if '/XObject' in page['/Resources']:
    xobject = page['/Resources']['/XObject'].getObject()
    for obj in xobject:
        if xobject[obj]['/Subtype'] == '/Image':
            size = (xobject[obj]['/Width'], xobject[obj]['/Height'])
            data = xobject[obj]._data
            print("*******", data)
            print(xobject[obj]['/Filter'])

Answer 1

[更新]我不认为PyPDF2可以从图像中读取文本...... 要把图像变成文本，我建议使用一些OCR工具，如 PyTesseract. 下面是一个使用的例子 pdf2图片和PyTesseract来实现你想要的东西（你需要先正确安装PyTesseractTesseract和pdf2image）。

import pdf2image
import pytesseract
from pytesseract import Output, TesseractError

pdf_path = "document.pdf"

images = pdf2image.convert_from_path(pdf_path)

pil_im = images[0] # assuming that we're interested in the first page only

ocr_dict = pytesseract.image_to_data(pil_im, lang='eng', output_type=Output.DICT)
# ocr_dict now holds all the OCR info including text and location on the image

text = " ".join(ocr_dict['text'])

使用Python PyPDF2从扫描的pdf(图像)中提取文本[已关闭]。

问题描述投票：0回答：1

1个回答

最新问题

使用Python PyPDF2从扫描的pdf(图像)中提取文本[已关闭]。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1