当我访问段落文本时,它不包括列表中的编号。
当前代码:
document = Document("C:/Foo.docx")
for p in document.paragraphs:
print(p.text)
docx 文件中的列表:
我期待:
(1) 两者的入籍...
(2) 入籍...
(3) 入籍...
我得到了什么:
两者的入籍...
入籍...
入籍...
检查文档的 XML 后,列表编号存储在 w:abstructNum 中,但我不知道如何访问它们或将它们连接到正确的列表项。 如何访问 python-docx 中每个列表项的编号,以便将它们包含在我的输出中? 有没有办法使用 python-docx 确定这些列表的正确嵌套?
根据 [ReadThedocs.Python-DocX]:样式相关对象 - _NumberingStyle 对象,此功能尚未实现。 替代方案(至少其中之一)
[PyPI]:docx2python处理这些元素的方式很差(主要是因为它返回所有转换为字符串的内容)。
XML 文件 - 根据经验发现如何处理这个例子。一个很好的文档位置是 Office Open XML(我不知道它是否是所有处理 .docx 文件的工具(尤其是 MS Word)所遵循的标准):
w:pPr -> w:numPr)子节点
Id和级别:w:numId和w:ilvl子节点(上一个项目符号的节点)的w:val属性
word/numbering.xml 中)匹配:
w:numFmt和w:lvlText子节点的w:val属性(注意也包含了项目符号,可以根据项目符号来区分) 前述 w:numFmt 属性的值)
极其复杂,所以我提出了一种利用docx2python部分支持的解决方法(gainarie)。
测试文档(sample.docx - 使用 LibreOffice创建):
code00.py:
#!/usr/bin/env python
import sys
import docx
from docx2python import docx2python as dx2py
def ns_tag_name(node, name):
if node.nsmap and node.prefix:
return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
return name
def descendants(node, desc_strs):
if node is None:
return []
if not desc_strs:
return [node]
ret = {}
for child_str in desc_strs[0]:
for child in node.iterchildren(ns_tag_name(node, child_str)):
descs = descendants(child, desc_strs[1:])
if not descs:
continue
cd = ret.setdefault(child_str, [])
if isinstance(descs, list):
cd.extend(descs)
else:
cd.append(descs)
return ret
def simplified_descendants(desc_dict):
ret = []
for vs in desc_dict.values():
for v in vs:
if isinstance(v, dict):
ret.extend(simplified_descendants(v))
else:
ret.append(v)
return ret
def process_list_data(attrs, dx2py_elem):
#print(simplified_descendants(attrs))
desc = simplified_descendants(attrs)[0]
level = int(desc.attrib[ns_tag_name(desc, "val")])
elem = [i for i in dx2py_elem[0].split("\t") if i][0]#.rstrip(")")
return " " * level + elem + " "
def main(*argv):
fname = r"./sample.docx"
docd = docx.Document(fname)
docdpy = dx2py(fname)
dr = docdpy.docx_reader
#print(dr.files) # !!! Check word/numbering.xml !!!
docdpy_runs = docdpy.document_runs[0][0][0]
if len(docd.paragraphs) != len(docdpy_runs):
print("Lengths don't match. Abort")
return -1
subnode_tags = (("pPr",), ("numPr",), ("ilvl",)) # (("pPr",), ("numPr",), ("ilvl", "numId")) # numId is for matching elements from word/numbering.xml
for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
#print(par.text, l)
numbered_attrs = descendants(par._element, subnode_tags)
#print(numbered_attrs)
if numbered_attrs:
print(process_list_data(numbered_attrs, l) + par.text)
else:
print(par.text)
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
输出:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q066374154]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py
Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32
Doc title
doc subtitle
heading1 text0
Paragr0 line0
Paragr0 line1
Paragr0 line2
space Paragr0 line3
a) aa (numbered)
heading1 text1
Paragrx line0
Paragrx line1
a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below)
1) paragrx 1x (numbered)
a) paragrx 1ax (numbered)
I) paragrx 1aIx (numbered)
b) paragrx 1bx (numbered)
2) paragrx 2x (numbered)
3) paragrx 3x (numbered)
-- paragrx bullet 0
-- paragrx bullet 00
paragxx text
Done.
注释:
docx2python
from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)
python-docx
,无需自己处理它们。在 Word 中打开文档,打开 Visual Basic 编辑器 (
F11
),打开即时窗口 (
ctrl-G
),键入以下宏并按 Enter:
ActiveDocument.Range.ListFormat.ConvertNumbersToText
此时,您可以保存文档并运行通过python-docx
。
https://github.com/nguyendangson/extract_number_heading_python-docx
输入:你的docx路径输出:数字和名称标题的列表。
def extract_number_heading(doc_path: str):
'''
Input: path of a docx file
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,...
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]
for paragraph in doc.paragraphs:
# Check if the paragraph is a heading
if paragraph.style.name.startswith('Heading'): # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
# Save heading numbers and heading names
heading_numbers.append(paragraph.style.name.split()[1])
heading_name.append((paragraph.text))
#print(heading_numbers)
# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
if heading_numbers[i] =='1':
heading_sections_result.append(str(chapter))
chapter+=1
else:
if int(heading_numbers[i]) > int(heading_numbers[i-1]):
if int(heading_numbers[i]) == 2:
heading_sections_result.append(heading_sections_result[-1]+'.1')
else:
heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))
elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
u = heading_sections_result[-1].split('.')
u[-1] = str(int(u[-1])+1)
heading_sections_result.append('.'.join(u))
elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
u = heading_sections_result[-1].split('.')
u = u[:int(heading_numbers[i])]
u[-1] = str(int(u[-1])+1)
heading_sections_result.append('.'.join(u))
# Combine heading_name and heading_sections_result
total_heading = []
for i in range(len(heading_numbers)):
total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])
return total_heading