我尝试了以下导入,但我的内核总是死掉,我该如何解决这个问题?
from unstructured.partition.pdf import partition_pdf
path = 'data/llama.pdf'
raw_pdf_elements=partition_pdf(
filename=path,
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_outpur_dir_path='images/'
)
第一行出现问题,但我需要实现 raw_pdf_elements 行,然后由于超立方体路径出现了一些问题,然后我安装了以下
pip install tesseract
pip install tesseract-ocr
此后我的内核开始死亡。 退出
> 00:01:01.922 [error] Disposing session as kernel process died
> ExitCode: undefined, Reason: 00:01:01.922 [info] Dispose Kernel
> process 35807. 00:01:01.945 [info] End cell 98 execution after
> -1709672459.206s, completed @ undefined, started @ 1709672459206
这是来自 Langchain 食谱吗? |半结构化 RAG
我在 Google Colab 上使用过这个:
!sudo apt install tesseract-ocr !pip 安装 pytesseract
%pip 安装 pdf2image
!apt-get 安装 poppler-utils
%重置