PDF 抓取 - 所有传递的对象均无

问题描述 投票:0回答:1

我正在尝试使用 pandas 和 pdfquery 创建一个简单的 pdf scraper。我想使用 xml 坐标从 PDF 的每一页中获取所需的数据,将其放入数据框中,然后将数据框保存为 csv 文件。我在最后一部分遇到问题,我可以从单个 pdf/页面获取数据,但似乎无法让它在多个页面上工作。我是 python 的相对初学者,所以非常感谢您的帮助。

import pdfquery
import pandas as pd
pdf = pdfquery.PDFQuery(r'path')
pdf.load()
pdf.tree.write('pdfXML.txt', pretty_print = True)
def pdfscrape(pdf):
    num_1 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("378.0, 
759.06, 456.0, 769.06")').text()
    num_2 = pdf.pq('LTTextBoxHorizontal:overlaps_bbox("30.0, 
431.06, 360.0, 441.06")').text()
page = pd.DataFrame({ 'num1': num_1,'num2': num_2, },index=[0])
print(page)
pagecount = pdf.doc.catalog['Pages'].resolve()['Count']
master = pd.DataFrame()
for p in range(pagecount):
    pdf.load(p)
    page = pdfscrape(pdf)
    master = master(pd.concat([page], ignore_index=True))
    master.to_csv("output.csv", index=False)

我期望的结果是一个 csv 文件,其中包含 pdf 每页所需的数据点。相反,我得到:

Traceback (most recent call last):
    master = master(pd.concat([page], ignore_index=True))
line 380, in concat
    op = _Concatenator(
line 443, in __init__
    objs, keys = self._clean_keys_and_objs(objs, keys)
line 539, in _clean_keys_and_objs
    raise ValueError("All objects passed were None")
ValueError: All objects passed were None
python pandas dataframe export-to-csv pdf-scraping
1个回答
0
投票

您可以做的是加载您感兴趣的页面:

import pdfquery
import pandas as pd

def read_page(t):
    query1 = (56.8, 771.397, 188.992, 783.397)
    text1 = pdf.pq('LTTextLineHorizontal:overlaps_bbox("%d, %d, %d, %d")' % query1).text()
    print(f"From function call:  {text1}\n")

pdf = pdfquery.PDFQuery('Doc_for_PDF.pdf')
pdf.load() #load all pages for the dataframe
pdf.tree.write('pdfXML.xml', pretty_print = True)

df = pd.read_xml('pdfXML.xml', xpath='.//LTTextLineHorizontal')
#print(df.to_string())
print(df.head())
print()

t = pdf.tree.write('pdfXML.xml', pretty_print = True)

# load page by page here
for i in range(0, pdf.doc.catalog['Pages'].resolve()['Count']):
    read_page(pdf.load(i))

输出:

        y0       y1  ...  word_margin                        LTTextBoxHorizontal
0  771.397  783.397  ...          0.1  This Text should be scrappt on first page
1  771.397  783.397  ...          0.1    This Text should be scrappt second page
2  729.997  741.997  ...          0.1                                   This not

[3 rows x 9 columns]

From function call:  This Text should be scrappt on first page

From function call:  This Text should be scrappt second page
© www.soinside.com 2019 - 2024. All rights reserved.