pdf 到使用 Tesseract-OCR 的 docx 转换器

问题描述 投票:0回答:0

我正在尝试开发一个转换器,可以准确地从 PDF 文件中提取文本、表格、对齐方式和制表符格式,包括那些用 Bangla Bangladeshi 语言编写的文件。目前,我使用的代码无法理解 Bangla Bangladeshi 语言,

所以我正在尝试合并 Tesseract-OCR 来改善输出。但是,我仍然面临获得所需输出的问题。

这是我到目前为止的代码:

import os
from pdf2docx import Converter
from pathlib import Path
from tkinter import messagebox as mb
import pytesseract


current_path = os.path.dirname(os.path.abspath(__file__))

input = os.path.join(current_path, "input.pdf")
output = os.path.join(current_path)


def pdf_to_docx_and_extract_text(path, save_path):
    """Convert PDF file to DOCX file and extract Bangla text using pytesseract"""

    if Path(path).is_file() and (
        Path(path).suffix == ".pdf" or Path(path).suffix == ".PDF"
    ):
        try:
            docx_file = os.path.splitext(os.path.basename(path))[0] + ".docx"
            cv = Converter(path)
            cv.convert(f"{save_path}/{docx_file}", start=0, end=None)
            cv.close()

            # Extract Bangla text from the converted DOCX file
            docx_path = f"{save_path}/{docx_file}"
            text = pytesseract.image_to_string(
                docx_path, lang="ben"
            )  # Specify 'ben' for Bangla language
            print("Extracted Bangla text:")
            print(text)

        except Exception as ex:
            mb.showerror("ERROR", f"File has not been decrypted")
    else:
        mb.showerror("INFO", "File not found!")


pdf_to_docx_and_extract_text(input, output)

我正在寻求帮助以增强它,以从用 Bangla Bangladeshi 语言编写的 PDF 文件中提取文本、表格、对齐方式和制表符格式。

python python-tesseract
© www.soinside.com 2019 - 2024. All rights reserved.