[使用python从扫描的pdf中提取Pdf数据

问题描述投票：1回答：1

我是通过tesseract ocr从扫描的pdf中提取数据的，我能够提取数据，但准确性不高。在许多地方，它显示了错误的数据，因此我可以通过python以100％的精度获取数据。

首先，我将pdf转换为jpg格式，然后使用tesseract模块从图像中提取数据。

from PIL import Image
import pytesseract

text=(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg")))
text=repr(text)
text=text.replace(r"\n","")
print(text)

我期望从pdf获得正确的数据，但是我得到不同的数据，例如z显示2,5是s，1是I，等等

python-3.x ocr python-tesseract pdfminer pdf-extraction

1个回答

0
投票

希望下面的一些小改变对您有所帮助。

from PIL import Image
import pytesseract

text=str(pytesseract.image_to_string(Image.open(r"C:\Users\sumesh\Desktop\ip\ip\pdf11.jpg"),lang='eng'))

text=text.replace("\n","")

print(text)

最新问题

© www.soinside.com 2019 - 2024. All rights reserved.