如何使用Python从两列pdf中提取文本？

Question

我有：

我有一个两列格式的PDF。有没有办法按照两列格式阅读每个PDF而不单独裁剪每个PDF？

Answer 1

这是我用于常规pdf解析的代码，它似乎在该图像上正常工作（我下载了一个图像，因此它使用光学字符识别，因此它与常规OCR一样准确）。请注意，这会对文本进行标记。另请注意，您需要安装tesseract才能使用（pytesseract只是让tesseract从python中运行）。 Tesseract是免费和开源的。

from PIL import Image
import pytesseract
import cv2
import os

def parse(image_path, threshold=False, blur=False):
    image = cv2.imread(image_path)
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    if threshold:
        gray = cv2.threshold(gray, 0, 255, \
            cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
    if blur: #useful if salt-and-pepper background.
        gray = cv2.medianBlur(gray, 3)
    filename = "{}.png".format(os.getpid())
    cv2.imwrite(filename, gray) #Create a temp file
    text = pytesseract.image_to_string(Image.open(filename))
    os.remove(filename) #Remove the temp file
    text = text.split() #PROCESS HERE.
    print(text)
a = parse(image_path, True, False)

如何使用Python从两列pdf中提取文本？

问题描述投票：0回答：1

1个回答

最新问题

如何使用Python从两列pdf中提取文本？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1