非结构化partition_pdf找不到tesseract

问题描述 投票:0回答:1

我正在尝试在 Windows 计算机上使用 Unstructed 中的

partition_pdf
函数和
strategy="hi_res"
。该函数一直失败,因为它找不到 Tesseract 可执行文件的路径。如果我直接使用 pytesseract 我可以设置这个路径,但我不知道如何在非结构化中设置它。有人知道如何解决这个问题吗?系统是Windows 10。

完整命令:

from unstructured.partition.pdf import partition_pdf

fname = "test_file.pdf"

elements = partition_pdf(filename=fname, strategy="hi_res")

错误:

...
File ~\AppData\Local\anaconda3\envs\unstructured\lib\site-packages\unstructured_pytesseract\pytesseract.py:453, in get_tesseract_version()
    446     output = subprocess.check_output(
    447         [tesseract_cmd, '--version'],
    448         stderr=subprocess.STDOUT,
    449         env=environ,
    450         stdin=subprocess.DEVNULL,
    451     )
    452 except OSError:
--> 453     raise TesseractNotFoundError()
    455 raw_version = output.decode(DEFAULT_ENCODING)
    456 str_version, *_ = raw_version.lstrip(string.printable[10:]).partition(' ')

TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.
python pdf tesseract langchain
1个回答
0
投票

您可以在调用代码的

unstructured
库中添加代码的路径,您要查找的文件位于
unstructured/partition/utils/ocr_models/tesseract_ocr.py
中。在这种情况下,完整路径将是
~/AppData/Local/anaconda3/envs/unstructured/lib/site-packages/unstructured/partition/utils/ocr_models/tesseract_ocr.py

添加的方法之一是在类之前

OCRAgentTesseract(OCRAgent)
,所以最终结果将是这样的:

from unstructured.logger import logger
from unstructured.partition.utils.config import env_config`
from unstructured.partition.utils.constants import (
    IMAGE_COLOR_DEPTH,
    TESSERACT_MAX_SIZE,
    TESSERACT_TEXT_HEIGHT,
    Source,
)
from unstructured.partition.utils.ocr_models.ocr_interface import OCRAgent
from unstructured.utils import requires_dependencies

if TYPE_CHECKING:
    from unstructured_inference.inference.elements import TextRegion
    from unstructured_inference.inference.layoutelement import (
        LayoutElement,
    )

unstructured_pytesseract.pytesseract.tesseract_cmd = r"<path_to_tesseract>" <- add here

class OCRAgentTesseract(OCRAgent):
© www.soinside.com 2019 - 2024. All rights reserved.