下午好,
我一直在设置一些使用 fitz 库(PyMuPDF)提取文本的代码。 模块已通过 lambda 层正确安装,并且按预期工作,但是当我尝试使用官方的 fitz utils 脚本时,我得到了
ModuleNotFoundError: No module named 'fitz'
示例代码:
def extract_text(pdf_stream):
try:
pdf_doc = fitz.open(stream=pdf_stream, filetype='pdf')
# Save the PDF document to a file
pdf_doc.save('/tmp/file.pdf') #/tmp is a file destination required to save file, everything else is read only in lambda
logger.info("PDF file saved. Running fitzcli.py.")
cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"]
subprocess.run(cmd_args, check=True)
with open('extracted_text.txt', 'r') as open_file:
read_file = open_file.read()
# Assuming extract_top_rows function is defined elsewhere in your code
headers_text = extract_top_rows(read_file)
return headers_text
except Exception as e:
logger.error(f"An error occurred: {e}")
raise
脚本链接 https://github.com/pymupdf/PyMuPDF-Utilities/blob/master/text-extraction/fitzcli.py
由于许可限制,我无法更改代码
我尝试复制 lambda 执行环境并使用该环境运行子进程。
env = os.environ.copy()
cmd_args = ["python", "fitzcli.py", "gettext", "-input", "file.pdf", "-output", "tmp/extracted_text.txt", "-mode", "layout"]
subprocess.run(cmd_args, check=True, env=env)
期望运行子进程
@simpleApp fitz 在这部分代码执行时导入得很好
pdf_doc = fitz.open(stream=pdf_stream, filetype='pdf')
显然,仅在使用子进程时 [PYTHONPATH] 未正确设置。 所以修复方法是将其指向模块路径并将 lambda 执行环境的副本传递给子进程
env = os.environ.copy()
env['PYTHONPATH'] = '/opt/python/lib/python3.12/site-packages'
cmd_args = ["python", "fitzcli.py", "gettext", "file.pdf", "-output", "/tmp/extracted_text.txt", "-mode", "layout"]
subprocess.run(cmd_args, check=True, env=env)