用于信息提取的 PDF 到 HTML 和 OCR 解决方案

我正在寻找云端或 SDK 格式的 PDF 转 HTML 和 OCR 服务解决方案。经过搜索，我发现互联网上有很多服务。我尝试了其中一些并得到了一些想法。我想知道你们中是否有人使用此类服务。

我最关心的是拥有一个自动化结构来生成可用于信息提取的 HTML 输出。我想要像表格一样的结构化数据输出。（大多数服务提供 HTML 输出，格式为 -character 格式（每个字符的 CSS/HTML 标记）或 -paragraph 格式（每行的 CSS/HTML）。

到目前为止我检查过：

Abbyy Cloud SDK（他们没有 PDF 到 HTML 服务，但 PDF 到 XML 可以通过 XSLT 支持转换为 HTML（也许）。此外，带有文本输出的 OCR 服务也相当不错）
cloudconvert.org（他们提供与基于 poppler-Xpdf3.0 的 Ubuntu pdftohtml 命令相同的结果）
pdftohtml commamd（在 Ubuntu 上测试） - 我得到的结果充满 < p >。
aspose.PDF（他们在云中没有 PDF 到 HTML 服务，但与 GDrive、Dropbox 和 Amazon s3 具有良好的集成。

我的问题是，您是否知道任何其他值得尝试并获取结构 HTML 输出以进行数据提取的服务。

提前致谢。

0
投票

也许不是理想的解决方案，但这可以完成一些简洁的 PDF 到 hOCR 转换。您也许可以将输出保存到

.html

 文件中。

def pdf_to_images_to_hocr(pdf_file_path: list, images_dir_path: str, output_hocr_file_path: str) -> None:
    """
    OCR on images using tesseract.
    - Converts the PDF file to images.
    - Uses tesseract to perform OCR on the images.
    - Generates a hocr file.
    - The hocr file is stored in the output_hocr_file_path.
    - The images are stored temporarily in the images_dir_path.
        - Ensure that the directory exists.
        - The directory is cleared before and after the operation.

    Args:
    - pdf_file_path (str): Path to the PDF file.
    - images_dir_path (str): Path to the directory where the images will be stored temporarily.
    - output_hocr_file_path (str): Path to the hocr file.

    Docs:
    - https://tesseract-ocr.github.io/tessdoc/Command-Line-Usage.html
    """
    # ClearTextBack/temp/images/* should exist.
    os.system(f"rm -rf {images_dir_path}/*")

    image_file_paths = convert_pdf_to_images(pdf_file_path, images_dir_path)
    hocr_files = []

    for image_file_name in image_file_paths:
        hocr_file = image_file_name.replace(".jpeg", "")
        os.system(f"tesseract {image_file_name} {hocr_file} -l eng hocr")
        hocr_files.append(hocr_file)

    with open(output_hocr_file_path, "w") as hocr_file:
        for hocr_file_path in hocr_files:
            hocr_file_path += ".hocr"
            with open(hocr_file_path, "r") as f:
                hocr_file.write(f.read())
    
    os.system(f"rm -rf {images_dir_path}/*")
    print(f">> HOCR file {output_hocr_file_path} generated successfully.")

这是它的辅助函数：

def convert_pdf_to_images(pdf_file_path: str, output_save_dir: str = '') -> list:
    """
    Converts a PDF file to a list of images.
    - Uses the pdf2image package to convert the PDF file to images.
    - Returns a list of image file paths.
    - If output_save_path is provided, the images are stored in the output_save_path.
    - If output_save_path is not provided, are not stored.
    - Either way, the list of image file paths is returned.

    Args:
    - pdf_file_path (str): Path to the PDF file.
    - output_save_dir (str): Path to the directory where the images will be stored. Default is ''.

    Docs:
    - https://pypi.org/project/pdf2image/

    Returns:
    - image_files (list): List of image file paths.
    """
    from pdf2image import convert_from_path

    images = convert_from_path(pdf_file_path)
    
    image_files = []
    for i, image in enumerate(images):
        image_file_path = f"{output_save_dir}/page_{i}.jpeg"
        if output_save_dir:
            image.save(image_file_path, "JPEG")
        image_files.append(image_file_path)

    print(">> PDF converted to images successfully.")
    return image_files

不要忘记安装依赖项。

问题描述投票：0回答：1

1个回答

最新问题

用于信息提取的 PDF 到 HTML 和 OCR 解决方案

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1