我正在寻找云端或 SDK 格式的 PDF 转 HTML 和 OCR 服务解决方案。经过搜索,我发现互联网上有很多服务。我尝试了其中一些并得到了一些想法。我想知道你们中是否有人使用此类服务。
我最关心的是拥有一个自动化结构来生成可用于信息提取的 HTML 输出。我想要像表格一样的结构化数据输出。 (大多数服务提供 HTML 输出,格式为 -character 格式(每个字符的 CSS/HTML 标记)或 -paragraph 格式(每行的 CSS/HTML)。
到目前为止我检查过:
提前致谢。
.html
文件中。
def pdf_to_images_to_hocr(pdf_file_path: list, images_dir_path: str, output_hocr_file_path: str) -> None:
"""
OCR on images using tesseract.
- Converts the PDF file to images.
- Uses tesseract to perform OCR on the images.
- Generates a hocr file.
- The hocr file is stored in the output_hocr_file_path.
- The images are stored temporarily in the images_dir_path.
- Ensure that the directory exists.
- The directory is cleared before and after the operation.
Args:
- pdf_file_path (str): Path to the PDF file.
- images_dir_path (str): Path to the directory where the images will be stored temporarily.
- output_hocr_file_path (str): Path to the hocr file.
Docs:
- https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html
"""
# ClearTextBack/temp/images/* should exist.
os.system(f"rm -rf {images_dir_path}/*")
image_file_paths = convert_pdf_to_images(pdf_file_path, images_dir_path)
hocr_files = []
for image_file_name in image_file_paths:
hocr_file = image_file_name.replace(".jpeg", "")
os.system(f"tesseract {image_file_name} {hocr_file} -l eng hocr")
hocr_files.append(hocr_file)
with open(output_hocr_file_path, "w") as hocr_file:
for hocr_file_path in hocr_files:
hocr_file_path += ".hocr"
with open(hocr_file_path, "r") as f:
hocr_file.write(f.read())
os.system(f"rm -rf {images_dir_path}/*")
print(f">> HOCR file {output_hocr_file_path} generated successfully.")
这是它的辅助函数:
def convert_pdf_to_images(pdf_file_path: str, output_save_dir: str = '') -> list:
"""
Converts a PDF file to a list of images.
- Uses the pdf2image package to convert the PDF file to images.
- Returns a list of image file paths.
- If output_save_path is provided, the images are stored in the output_save_path.
- If output_save_path is not provided, are not stored.
- Either way, the list of image file paths is returned.
Args:
- pdf_file_path (str): Path to the PDF file.
- output_save_dir (str): Path to the directory where the images will be stored. Default is ''.
Docs:
- https://pypi.org/project/pdf2image/
Returns:
- image_files (list): List of image file paths.
"""
from pdf2image import convert_from_path
images = convert_from_path(pdf_file_path)
image_files = []
for i, image in enumerate(images):
image_file_path = f"{output_save_dir}/page_{i}.jpeg"
if output_save_dir:
image.save(image_file_path, "JPEG")
image_files.append(image_file_path)
print(">> PDF converted to images successfully.")
return image_files
不要忘记安装依赖项。