如何从图像中提取文本数据并将其构建到 Excel 表格中

问题描述 投票:0回答:1

我正在提取 JSON 数据,它提供以下结果。该数据包括从图片中检索到的文本的坐标。有没有办法识别表格并将数据存储在 Excel 中?

此数据是使用 jarded AI 的 EasyOCR 提取的。我需要将数据转换为适合的表格格式。此外,EasyOCR 不接受 PDF。有没有办法将 PDF 转换为 PNG?

result = [
    [[[1395, 95], [1557, 95], [1557, 137], [1395, 137]], 'Nst'],
    [[[663, 197], [779, 197], [779, 239], [663, 239]], 'COPY'],
    [[[200, 248], [586, 248], [586, 394], [200, 394]], 'TOM CHRIS LDSAY Ardmore Search Partners 4US'],
    [[[1004, 248], [1369, 248], [1369, 485], [1004, 485]], 'COMMERCIAL CARDS DIVISION Cards Customer Services PO BOX 5000 SOUTHEND-ON-SEA SP2 9AM Telephone: 1243 673 3701 Facsimile: 1234 789 5281 Monday Friday: 08.00 18.00 Saturday: 09.00 13.00'],
    [[[210, 596], [544, 596], [544, 628], [210, 628]], '06 February 05 March 2024'],
    # And so on...
]

我尝试使用以下方法与

openpyxl

from openpyxl import Workbook
wb = Workbook()
ws = wb.active

for item in result:
    bounding_box = item[0]  # Bounding box coordinates
    text = item[1]  # Text content

    # Determine row and column based on bounding box coordinates
    # For simplicity, let's assume each bounding box represents a row
    row = bounding_box[0][1]  # Y-coordinate of the top-left corner
    column = bounding_box[0][0]  # X-coordinate of the top-left corner

    # Write text to the corresponding cell in the worksheet
    ws.cell(row=row, column=column).value = text
wb.save("output.xlsx")
python excel coordinates ocr
1个回答
0
投票

将文本写入 TLC 的单元格位置或任何位置是没有意义的,这意味着文本值将出现在整个工作表的不同位置。

您提到您想将数据放入表格中。那为什么不这样做呢?
只需将数据转储到工作表上的表格布局中即可完成此操作。

from openpyxl import Workbook
from openpyxl.styles import Font, Alignment
from openpyxl.utils.cell import get_column_letter as gcl

result = [
    [[[1395, 95], [1557, 95], [1557, 137], [1395, 137]], 'Nst'],
    [[[663, 197], [779, 197], [779, 239], [663, 239]], 'COPY'],
    [[[200, 248], [586, 248], [586, 394], [200, 394]], 'TOM CHRIS LDSAY Ardmore Search Partners 4US'],
    [[[1004, 248], [1369, 248], [1369, 485], [1004, 485]], 'COMMERCIAL CARDS DIVISION Cards Customer Services PO BOX 5000 SOUTHEND-ON-SEA SP2 9AM Telephone: 1243 673 3701 Facsimile: 1234 789 5281 Monday Friday: 08.00 18.00 Saturday: 09.00 13.00'],
    [[[210, 596], [544, 596], [544, 628], [210, 628]], '06 February 05 March 2024'],
    # And so on...
]


wb = Workbook()
ws = wb.active

### Add some headers to the tabulated data
## Two rows merged 1st row, 2nd row for X and Y co-ord header
headers = [['Bottom Left', '', 'Top Left', '', 'Top Right', '', 'Bottom Right', '', 'Text'],
           ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y']
]
for header in headers:  # Append the headers to the Sheet (first row)
    ws.append(header)

## Merge the first row common columns
merges = ['A1:B1', 'C1:D1', 'E1:F1', 'G1:H1', 'I1:I2']
for merge in merges:
    ws.merge_cells(merge)

### Set some formatting for the Headers, bold and align middle and top
for row in ws.iter_rows(max_row=2):
    for cell in row:
        cell.font = Font(bold=True)
        cell.alignment = Alignment(horizontal='center', vertical='top')

### Flatten the list of lists for each row of data then append to the Sheet
for item in result:
    flattened_list = [ele for group in item[0] for ele in group]
    flattened_list.append(item[1])
    ws.append(flattened_list)  # each row is appended below the last added row

### Set some column widths
for col in range(1, 9):
    ws.column_dimensions[gcl(col)].width = 8

### Save Excel file
wb.save("tabulated.xlsx")

不确定坐标的命名是否正确,因此您可以根据需要更改它们

© www.soinside.com 2019 - 2024. All rights reserved.