我想构建从html中的word加载的表格,但一个大问题是合并的单元格,我得到的最好的结果是返回单元格的值而不重复合并的单元格,但我停在那里,不知道我如何可以继续
from docx import Document
def iter_unique_cells(row):
prior_tc = None
for cell in row.cells:
this_tc = cell._tc
if this_tc is prior_tc:
continue
prior_tc = this_tc
yield cell
document = Document("document.docx")
for table in document.tables:
for row in table.rows:
for cell in iter_unique_cells(row):
for paragraph in cell.paragraphs:
print(paragraph.text)
我会重写
iter_unique_cells
函数来返回当前单元格是否合并。然后,您可以通过将 colspan="2"
添加到 <td></td>
元素来将此信息集成到 html 中。这应该合并单元格(水平)。为了构建 html,我将在所有循环外部声明一个字符串,并在每次迭代开始时添加每个元素的开始标记,在末尾添加结束标记。
from docx import Document
def iter_unique_cells(row):
... # modify to return cell, is_merged
document = Document("document.docx")
html = ""
for table in document.tables:
html += "<table>"
for row in table.rows:
html += "<tr>"
for cell, is_merged in iter_unique_cells(row):
html += "<td colspan='2'>" if is_merged else "<td>"
for paragraph in cell.paragraphs:
html += f"<p>{paragraph.text}</p>"
html += "</td>"
html += "</tr>"
html += "</table>"