使用Python-Docx获取docx文件中列表项的列表编号

Question

当我访问段落文本时，它不包括列表中的编号。

当前代码：

document = Document("C:/Foo.docx")
for p in document.paragraphs:
     print(p.text)

docx 文件中的列表：

我期待：
(1) 两者的入籍...
(2) 入籍...
(3) 入籍...

我得到了什么：
两者的入籍...
入籍...
入籍...

检查文档的 XML 后，列表编号存储在 w:abstructNum 中，但我不知道如何访问它们或将它们连接到正确的列表项。如何访问 python-docx 中每个列表项的编号，以便将它们包含在我的输出中？有没有办法使用 python-docx 确定这些列表的正确嵌套？

Answer 1

根据 [ReadThedocs.Python-DocX]：样式相关对象 - _NumberingStyle 对象，此功能尚未实现。替代方案（至少其中之一）
[PyPI]：docx2python处理这些元素的方式很差（主要是因为它返回所有转换为字符串的内容）。

因此，解决方案是手动解析

XML 文件 - 根据经验发现如何处理这个例子。一个很好的文档位置是 Office Open XML（我不知道它是否是所有处理 .docx 文件的工具（尤其是 MS Word）所遵循的标准）：

word/document.xml获取每个段落（w:p节点）
- 检查它是否是编号项（它有
  w:pPr -> w:numPr）子节点
- 获取数字样式
  Id和级别：w:numId和w:ilvl子节点（上一个项目符号的节点）的w:val属性
- 将 2 个值与（在
  word/numbering.xml 中）匹配：
  - w:abstractNumId w:abstractNum 节点的属性
  - w:ilvl w:lvl子节点的属性
  并获取对应的
  w:numFmt和w:lvlText子节点的w:val属性（注意也包含了项目符号，可以根据项目符号来区分）前述 w:numFmt 属性的值）

然而，这似乎

极其复杂，所以我提出了一种利用docx2python部分支持的解决方法（gainarie）。

测试文档（

sample.docx - 使用 LibreOffice创建）：

code00.py：

#!/usr/bin/env python

import sys
import docx
from docx2python import docx2python as dx2py


def ns_tag_name(node, name):
    if node.nsmap and node.prefix:
        return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
    return name


def descendants(node, desc_strs):
    if node is None:
        return []
    if not desc_strs:
        return [node]
    ret = {}
    for child_str in desc_strs[0]:
        for child in node.iterchildren(ns_tag_name(node, child_str)):
            descs = descendants(child, desc_strs[1:])
            if not descs:
                continue
            cd = ret.setdefault(child_str, [])
            if isinstance(descs, list):
                cd.extend(descs)
            else:
                cd.append(descs)
    return ret


def simplified_descendants(desc_dict):
    ret = []
    for vs in desc_dict.values():
        for v in vs:
            if isinstance(v, dict):
                ret.extend(simplified_descendants(v))
            else:
                ret.append(v)
    return ret


def process_list_data(attrs, dx2py_elem):
    #print(simplified_descendants(attrs))
    desc = simplified_descendants(attrs)[0]
    level = int(desc.attrib[ns_tag_name(desc, "val")])
    elem = [i for i in dx2py_elem[0].split("\t") if i][0]#.rstrip(")")
    return "    " * level + elem + " "


def main(*argv):
    fname = r"./sample.docx"
    docd = docx.Document(fname)
    docdpy = dx2py(fname)
    dr = docdpy.docx_reader
    #print(dr.files)  # !!! Check word/numbering.xml !!!
    docdpy_runs = docdpy.document_runs[0][0][0]
    if len(docd.paragraphs) != len(docdpy_runs):
        print("Lengths don't match. Abort")
        return -1
    subnode_tags = (("pPr",), ("numPr",), ("ilvl",))  # (("pPr",), ("numPr",), ("ilvl", "numId"))  # numId is for matching elements from word/numbering.xml
    for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
        #print(par.text, l)
        numbered_attrs = descendants(par._element, subnode_tags)
        #print(numbered_attrs)
        if numbered_attrs:
            print(process_list_data(numbered_attrs, l) + par.text)
        else:
            print(par.text)


if __name__ == "__main__":
    print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
                                                   64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    rc = main(*sys.argv[1:])
    print("\nDone.")
    sys.exit(rc)

输出：

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q066374154]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32 Doc title doc subtitle heading1 text0 Paragr0 line0 Paragr0 line1 Paragr0 line2 space Paragr0 line3 a) aa (numbered) heading1 text1 Paragrx line0 Paragrx line1 a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below) 1) paragrx 1x (numbered) a) paragrx 1ax (numbered) I) paragrx 1aIx (numbered) b) paragrx 1bx (numbered) 2) paragrx 2x (numbered) 3) paragrx 3x (numbered) -- paragrx bullet 0 -- paragrx bullet 00 paragxx text Done.

注释：

word/document.xml 的节点（通过段落的 _element（LXML 节点）属性）
docx2python的限制）
descendants，simplified_descendants可以大大简化，但我想保持前者尽可能通用（如果需要扩展功能）

Answer 2

这对我有用，使用模块

docx2python

from docx2python import docx2python
document = docx2python("C:/input/MyDoc.docx")
print(document.body)

Answer 3

还有另一条路径，首先将编号转换为文本。之后您就可以照常使用

python-docx

，无需自己处理它们。

在 Word 中打开文档，打开 Visual Basic 编辑器 (

F11

)，打开即时窗口 (

ctrl-G

)，键入以下宏并按 Enter：

ActiveDocument.Range.ListFormat.ConvertNumbersToText

此时，您可以保存文档并运行通过

python-docx

。

Answer 4

python-docx 似乎没有提取数字标题的功能，所以我创建了它，请参阅我的 github：

https://github.com/nguyendangson/extract_number_heading_python-docx

输入：你的docx路径

输出：数字和名称标题的列表。

def extract_number_heading(doc_path: str):
'''
Input: path of a docx file 
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,... 
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]

for paragraph in doc.paragraphs:
    # Check if the paragraph is a heading
    if paragraph.style.name.startswith('Heading'):            # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
        # Save heading numbers and heading names      
        heading_numbers.append(paragraph.style.name.split()[1])
        heading_name.append((paragraph.text))
#print(heading_numbers)

# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
    if heading_numbers[i] =='1':
        heading_sections_result.append(str(chapter))
        chapter+=1
    else:
        if int(heading_numbers[i]) > int(heading_numbers[i-1]):

            if int(heading_numbers[i]) == 2:
                heading_sections_result.append(heading_sections_result[-1]+'.1')
            else:
                heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))


        elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))
   
        elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u = u[:int(heading_numbers[i])]
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))

# Combine heading_name and heading_sections_result 
total_heading = []
for i in range(len(heading_numbers)):
    total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])


return total_heading

使用Python-Docx获取docx文件中列表项的列表编号

问题描述投票：0回答：4

4个回答

最新问题

使用Python-Docx获取docx文件中列表项的列表编号

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4