我正在尝试开发一个转换器,可以准确地从 PDF 文件中提取文本、表格、对齐方式和制表符格式,包括那些用 Bangla Bangladeshi 语言编写的文件。目前,我使用的代码无法理解 Bangla Bangladeshi 语言,
所以我正在尝试合并 Tesseract-OCR 来改善输出。但是,我仍然面临获得所需输出的问题。
这是我到目前为止的代码:
import os
from pdf2docx import Converter
from pathlib import Path
from tkinter import messagebox as mb
import pytesseract
current_path = os.path.dirname(os.path.abspath(__file__))
input = os.path.join(current_path, "input.pdf")
output = os.path.join(current_path)
def pdf_to_docx_and_extract_text(path, save_path):
"""Convert PDF file to DOCX file and extract Bangla text using pytesseract"""
if Path(path).is_file() and (
Path(path).suffix == ".pdf" or Path(path).suffix == ".PDF"
):
try:
docx_file = os.path.splitext(os.path.basename(path))[0] + ".docx"
cv = Converter(path)
cv.convert(f"{save_path}/{docx_file}", start=0, end=None)
cv.close()
# Extract Bangla text from the converted DOCX file
docx_path = f"{save_path}/{docx_file}"
text = pytesseract.image_to_string(
docx_path, lang="ben"
) # Specify 'ben' for Bangla language
print("Extracted Bangla text:")
print(text)
except Exception as ex:
mb.showerror("ERROR", f"File has not been decrypted")
else:
mb.showerror("INFO", "File not found!")
pdf_to_docx_and_extract_text(input, output)
我正在寻求帮助以增强它,以从用 Bangla Bangladeshi 语言编写的 PDF 文件中提取文本、表格、对齐方式和制表符格式。