从那时起我就一直在使用 pdfplumber。还有其他图书馆吗?除了camelot之外,它使用pypdf2,现在有一个错误:
File "C:\Users\USER\AppData\Local\Programs\Python\Python312\Lib\site-packages\PyPDF2\_utils.py", line 369, in deprecation_with_replacement
deprecation(DEPR_MSG_HAPPENED.format(old_name, removed_in, new_name))
File "C:\Users\USER\AppData\Local\Programs\Python\Python312\Lib\site-packages\PyPDF2\_utils.py", line 351, in deprecation
raise DeprecationError(msg)
PyPDF2.errors.DeprecationError:PdfFileReader 已弃用并在 PyPDF2 3.0.0 中删除。请改用 PdfReader。
还有什么其他方法存在吗谢谢!
我尝试使用 pdfplumber,这是当前的工作解决方案,但它不能很好地读取一些 pdf 表。
我主要使用 tabula-py:https://pypi.org/project/tabula-py/
来自 https://github.com/tabulapdf/tabula 的封闭 java jar 文件的包装器
>>> import tabula
>>> f = "https://annex.exploratorium.edu/ronh/solar_system/scale.pdf"
>>> df = tabula.read_pdf(f, stream=True)
'pages' argument isn't specified.Will extract only from page 1 by default.
>>> df[0].head()
Planet Distance Unnamed: 0 Distance to Scale distance Actual
0 NaN from Sun NaN planet from Sun diameter
1 NaN (AU) NaN (kilometers) (centimeters) (kilometers)
2 Sun (a star) 0 NaN NaN NaN 1,391,980
3 Mercury 0.39 NaN 58,000,000 NaN 4,880
4 Venus 0.72 NaN 108,000,000 NaN 12,100
它创建一个数据帧列表,您也可以使用本地文件。