Camelot Pdf提取失败解析

Question

我在使用Camelot库时遇到问题

我正在从PDF中提取数据，我的代码在前23页上运行正常，但是在这种情况下，它无法解析文本/表格结尾

我想问题是字符串很长到达表边界

[也尝试过“流”，但效果最差

PDF源数据

PDF输出布局

我解析的输出就像

"ALT4945\n24 V"
"70\/140 A   ALT5860\n12 V\n90 A"

所需的输出应该是

"ALT4945\n24 V 70\/140 A"
"ALT5860\n12 V\n90 A"

我在上一页中正常工作的第一个代码是

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice")

从网站Camelot Doc https://camelot-py.readthedocs.io/en/master/api.html，我在pdf解析器上获得了可能的配置。

"" PARAMS for lattice
line_scale  (default: 15)
copy_text   ((default: None))
shift_text  (default: ['l', 't'])
line_tol    (default: 2)
joint_tol   (default: 2)
threshold_blocksize   (default: 15)
threshold_constant    (default: -2)
iterations   (default: 0)
resolution   (default: 300)
"""

然后我遇到了这个问题，试图用更多的参数解决“玩法”，但没有找到赢家

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=3, joint_tol=3, threshold_blocksize=15)

tables = camelot.read_pdf("CROSSREFERENCE.pdf", pages=wPAGES, flavor="lattice", split_text=True, resolution=720, line_scale=250, line_tol=1, joint_tol=1, threshold_blocksize=3)

我能获得一些有关避免这种情况的建议吗？

谢谢

edit1：PDF来源：https://www.siom.it/images/catalogo-motorini-alter.pdf（第24页）

Answer 1

1
投票

Camelot Pdf提取失败解析

问题描述投票：0回答：1

1个回答

测试解决方案

最新问题

Camelot Pdf提取失败解析

问题描述 投票：0回答：1

1个回答

测试解决方案

最新问题

问题描述投票：0回答：1