AWS Lambda 子流程模块未找到错误

问题描述 投票:0回答:1

下午好,

我一直在设置一些使用 fitz 库(PyMuPDF)提取文本的代码。 模块已通过 lambda 层正确安装,并且按预期工作,但是当我尝试使用官方的 fitz utils 脚本时,我得到了

ModuleNotFoundError: No module named 'fitz'

示例代码:

def extract_text(pdf_stream):
    try:
        pdf_doc = fitz.open(stream=pdf_stream, filetype='pdf')
        
        # Save the PDF document to a file
        pdf_doc.save('/tmp/file.pdf') #/tmp is a file destination required to save file, everything else is read only in lambda
        
        logger.info("PDF file saved. Running fitzcli.py.")
        cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"]
        subprocess.run(cmd_args, check=True)

        with open('extracted_text.txt', 'r') as open_file:
            read_file = open_file.read()
        # Assuming extract_top_rows function is defined elsewhere in your code
        headers_text = extract_top_rows(read_file)
        
        return headers_text

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        raise

脚本链接 https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/fitzcli.py

由于许可限制,我无法更改代码

我尝试复制 lambda 执行环境并使用该环境运行子进程。

env = os.environ.copy()
cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"]
subprocess.run(cmd_args, check=True, env=env)

期望运行子进程

python aws-lambda operating-system subprocess command-line-interface
1个回答
0
投票

@simpleApp fitz 在这部分代码执行时导入得很好

pdf_doc = fitz.open(stream=pdf_stream, filetype='pdf')

显然,仅在使用子进程时 [PYTHONPATH] 未正确设置。 所以修复方法是将其指向模块路径并将 lambda 执行环境的副本传递给子进程

env = os.environ.copy()
env['PYTHONPATH'] = '/opt/python/lib/python3.12/site-packages'
cmd_args = ["python", "fitzcli.py", "gettext", "file.pdf", "-output", "/tmp/extracted_text.txt", "-mode", "layout"]
subprocess.run(cmd_args, check=True, env=env)
© www.soinside.com 2019 - 2024. All rights reserved.