我正在尝试在 Windows 计算机上使用 Unstructed 中的
partition_pdf
函数和 strategy="hi_res"
。该函数一直失败,因为它找不到 Tesseract 可执行文件的路径。如果我直接使用 pytesseract 我可以设置这个路径,但我不知道如何在非结构化中设置它。有人知道如何解决这个问题吗?系统是Windows 10。
完整命令:
from unstructured.partition.pdf import partition_pdf
fname = "test_file.pdf"
elements = partition_pdf(filename=fname, strategy="hi_res")
错误:
...
File ~\AppData\Local\anaconda3\envs\unstructured\lib\site-packages\unstructured_pytesseract\pytesseract.py:453, in get_tesseract_version()
446 output = subprocess.check_output(
447 [tesseract_cmd, '--version'],
448 stderr=subprocess.STDOUT,
449 env=environ,
450 stdin=subprocess.DEVNULL,
451 )
452 except OSError:
--> 453 raise TesseractNotFoundError()
455 raw_version = output.decode(DEFAULT_ENCODING)
456 str_version, *_ = raw_version.lstrip(string.printable[10:]).partition(' ')
TesseractNotFoundError: tesseract is not installed or it's not in your PATH. See README file for more information.
您可以在调用代码的
unstructured
库中添加代码的路径,您要查找的文件位于unstructured/partition/utils/ocr_models/tesseract_ocr.py
中。在这种情况下,完整路径将是 ~/AppData/Local/anaconda3/envs/unstructured/lib/site-packages/unstructured/partition/utils/ocr_models/tesseract_ocr.py
添加的方法之一是在类之前
OCRAgentTesseract(OCRAgent)
,所以最终结果将是这样的:
from unstructured.logger import logger
from unstructured.partition.utils.config import env_config`
from unstructured.partition.utils.constants import (
IMAGE_COLOR_DEPTH,
TESSERACT_MAX_SIZE,
TESSERACT_TEXT_HEIGHT,
Source,
)
from unstructured.partition.utils.ocr_models.ocr_interface import OCRAgent
from unstructured.utils import requires_dependencies
if TYPE_CHECKING:
from unstructured_inference.inference.elements import TextRegion
from unstructured_inference.inference.layoutelement import (
LayoutElement,
)
unstructured_pytesseract.pytesseract.tesseract_cmd = r"<path_to_tesseract>" <- add here
class OCRAgentTesseract(OCRAgent):