How to remove borders from PDF using Python and pdfplumber for Azure Form Recognizer?

Question

我目前正在从事一个涉及使用 Azure 表单识别器从 PDF 文件中提取信息的项目。虽然我已成功提取文本，但我在提取表格时遇到了问题。出现问题是因为由于存在页面边框，整个页面被视为表格。

为了克服这个问题，我试图在将 PDF 发送到 Azure 表单识别器之前删除 PDF 的边框。我在 Python 中使用 pdfplumber 库来提取矩形（边框），但我找不到修改 PDF 和删除这些边框的方法。

我附上了 PDF 页面的图像以及我当前使用的代码片段。此外，我还附上了一张图片，显示了我提取的最大高度的矩形。

对于如何使用 Python 从 PDF 中删除边框的任何帮助或建议，或任何实现预期结果的替代想法，我将不胜感激。

代码片段：

import pdfplumber
import pandas as pd
reader=pdfplumber.open('file.pdf')
pag=reader.pages[5]
df1=pd.DataFrame(pag.rect_edges)
df1=df1[df1['height']!=0.0]
print(df1)

PDF页面图片：

矩形坐标图像：

Answer 1

使用 pdfplumber 库，您走在正确的轨道上，但是单独使用 pdfplumber 修改 PDF 以删除边框并不简单。你可以试一试 PyPDF2。

这里有一段代码片段可以帮助您入门：

import PyPDF2
import pandas as pd

# Open the PDF file
with open('file.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    page = reader.pages[5]
    page_data = page.extract_text()

    # Remove the borders from the extracted text
    border_characters = ['|', '-', '+']  # Adjust this list based on the border characters present in your PDF
    for char in border_characters:
        page_data = page_data.replace(char, '')

    # Process the modified text data as desired (e.g., extracting tables)
    # ...

Instead, y    data = page_data.split('\n')
    df = pd.DataFrame([row.split() for row in data])

    print(df)

在此代码中，我们使用 PyPDF2 从 PDF 页面中提取文本。然后，我们使用 replace() 函数从提取的文本中删除指定的边框字符（如“|”、“-”、“+”）。不要忘记根据 PDF 中的边框字符调整 border_characters 列表。

随意根据需要处理修改后的文本数据，例如提取表格或将其转换为DataFrame。

请记住，此方法的成功取决于 PDF 中边框字符的一致性。如果边界多变或复杂，您可能需要额外的预处理步骤或高级技术。

How to remove borders from PDF using Python and pdfplumber for Azure Form Recognizer?

问题描述投票：0回答：1

1个回答

最新问题

How to remove borders from PDF using Python and pdfplumber for Azure Form Recognizer?

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1