我正在提取 JSON 数据,它提供以下结果。该数据包括从图片中检索到的文本的坐标。有没有办法识别表格并将数据存储在 Excel 中?
此数据是使用 jarded AI 的 EasyOCR 提取的。我需要将数据转换为适合的表格格式。此外,EasyOCR 不接受 PDF。有没有办法将 PDF 转换为 PNG?
result = [
[[[1395, 95], [1557, 95], [1557, 137], [1395, 137]], 'Nst'],
[[[663, 197], [779, 197], [779, 239], [663, 239]], 'COPY'],
[[[200, 248], [586, 248], [586, 394], [200, 394]], 'TOM CHRIS LDSAY Ardmore Search Partners 4US'],
[[[1004, 248], [1369, 248], [1369, 485], [1004, 485]], 'COMMERCIAL CARDS DIVISION Cards Customer Services PO BOX 5000 SOUTHEND-ON-SEA SP2 9AM Telephone: 1243 673 3701 Facsimile: 1234 789 5281 Monday Friday: 08.00 18.00 Saturday: 09.00 13.00'],
[[[210, 596], [544, 596], [544, 628], [210, 628]], '06 February 05 March 2024'],
# And so on...
]
我尝试使用以下方法与
openpyxl
:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for item in result:
bounding_box = item[0] # Bounding box coordinates
text = item[1] # Text content
# Determine row and column based on bounding box coordinates
# For simplicity, let's assume each bounding box represents a row
row = bounding_box[0][1] # Y-coordinate of the top-left corner
column = bounding_box[0][0] # X-coordinate of the top-left corner
# Write text to the corresponding cell in the worksheet
ws.cell(row=row, column=column).value = text
wb.save("output.xlsx")
将文本写入 TLC 的单元格位置或任何位置是没有意义的,这意味着文本值将出现在整个工作表的不同位置。
您提到您想将数据放入表格中。那为什么不这样做呢?
只需将数据转储到工作表上的表格布局中即可完成此操作。
from openpyxl import Workbook
from openpyxl.styles import Font, Alignment
from openpyxl.utils.cell import get_column_letter as gcl
result = [
[[[1395, 95], [1557, 95], [1557, 137], [1395, 137]], 'Nst'],
[[[663, 197], [779, 197], [779, 239], [663, 239]], 'COPY'],
[[[200, 248], [586, 248], [586, 394], [200, 394]], 'TOM CHRIS LDSAY Ardmore Search Partners 4US'],
[[[1004, 248], [1369, 248], [1369, 485], [1004, 485]], 'COMMERCIAL CARDS DIVISION Cards Customer Services PO BOX 5000 SOUTHEND-ON-SEA SP2 9AM Telephone: 1243 673 3701 Facsimile: 1234 789 5281 Monday Friday: 08.00 18.00 Saturday: 09.00 13.00'],
[[[210, 596], [544, 596], [544, 628], [210, 628]], '06 February 05 March 2024'],
# And so on...
]
wb = Workbook()
ws = wb.active
### Add some headers to the tabulated data
## Two rows merged 1st row, 2nd row for X and Y co-ord header
headers = [['Bottom Left', '', 'Top Left', '', 'Top Right', '', 'Bottom Right', '', 'Text'],
['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y']
]
for header in headers: # Append the headers to the Sheet (first row)
ws.append(header)
## Merge the first row common columns
merges = ['A1:B1', 'C1:D1', 'E1:F1', 'G1:H1', 'I1:I2']
for merge in merges:
ws.merge_cells(merge)
### Set some formatting for the Headers, bold and align middle and top
for row in ws.iter_rows(max_row=2):
for cell in row:
cell.font = Font(bold=True)
cell.alignment = Alignment(horizontal='center', vertical='top')
### Flatten the list of lists for each row of data then append to the Sheet
for item in result:
flattened_list = [ele for group in item[0] for ele in group]
flattened_list.append(item[1])
ws.append(flattened_list) # each row is appended below the last added row
### Set some column widths
for col in range(1, 9):
ws.column_dimensions[gcl(col)].width = 8
### Save Excel file
wb.save("tabulated.xlsx")