使用Python-Docx获取docx文件中列表项的列表编号

问题描述 投票:0回答:4

当我访问段落文本时,它不包括列表中的编号。

当前代码:

document = Document("C:/Foo.docx")
for p in document.paragraphs:
     print(p.text)

docx 文件中的列表:

我期待:
(1) 两者的入籍...
(2) 入籍...
(3) 入籍...

我得到了什么:
两者的入籍...
入籍...
入籍...

检查文档的 XML 后,列表编号存储在 w:abstructNum 中,但我不知道如何访问它们或将它们连接到正确的列表项。 如何访问 python-docx 中每个列表项的编号,以便将它们包含在我的输出中? 有没有办法使用 python-docx 确定这些列表的正确嵌套?

python python-3.x ms-word python-docx
4个回答
7
投票

根据 [ReadThedocs.Python-DocX]:样式相关对象 - _NumberingStyle 对象,此功能尚未实现替代方案(至少其中之一)
[PyPI]:docx2python处理这些元素的方式很差(主要是因为它返回所有转换为字符串的内容)。

因此,解决方案是手动解析

XML 文件 - 根据经验发现如何处理这个例子。一个很好的文档位置是 Office Open XML(我不知道它是否是所有处理 .docx 文件的工具(尤其是 MS Word)所遵循的标准):

  • word/document.xml获取每个段落(w:p节点)
    • 检查它是否是编号项(它有

      w:pPr -> w:numPr)子节点

    • 获取数字样式

      Id和级别:w:numIdw:ilvl子节点(上一个项目符号的节点)的w:val属性

    • 将 2 个值与(在

      word/numbering.xml 中)匹配:

      • w:abstractNumId w:abstractNum 节点的属性
      • w:ilvl w:lvl子节点的属性
      并获取对应的

      w:numFmtw:lvlText子节点的w:val属性(注意也包含了项目符号,可以根据项目符号来区分) 前述 w:numFmt 属性的值)

然而,这似乎

极其复杂,所以我提出了一种利用docx2python部分支持的解决方法(gainarie)。

测试文档(

sample.docx - 使用 LibreOffice创建):

code00.py

#!/usr/bin/env python import sys import docx from docx2python import docx2python as dx2py def ns_tag_name(node, name): if node.nsmap and node.prefix: return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name) return name def descendants(node, desc_strs): if node is None: return [] if not desc_strs: return [node] ret = {} for child_str in desc_strs[0]: for child in node.iterchildren(ns_tag_name(node, child_str)): descs = descendants(child, desc_strs[1:]) if not descs: continue cd = ret.setdefault(child_str, []) if isinstance(descs, list): cd.extend(descs) else: cd.append(descs) return ret def simplified_descendants(desc_dict): ret = [] for vs in desc_dict.values(): for v in vs: if isinstance(v, dict): ret.extend(simplified_descendants(v)) else: ret.append(v) return ret def process_list_data(attrs, dx2py_elem): #print(simplified_descendants(attrs)) desc = simplified_descendants(attrs)[0] level = int(desc.attrib[ns_tag_name(desc, "val")]) elem = [i for i in dx2py_elem[0].split("\t") if i][0]#.rstrip(")") return " " * level + elem + " " def main(*argv): fname = r"./sample.docx" docd = docx.Document(fname) docdpy = dx2py(fname) dr = docdpy.docx_reader #print(dr.files) # !!! Check word/numbering.xml !!! docdpy_runs = docdpy.document_runs[0][0][0] if len(docd.paragraphs) != len(docdpy_runs): print("Lengths don't match. Abort") return -1 subnode_tags = (("pPr",), ("numPr",), ("ilvl",)) # (("pPr",), ("numPr",), ("ilvl", "numId")) # numId is for matching elements from word/numbering.xml for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)): #print(par.text, l) numbered_attrs = descendants(par._element, subnode_tags) #print(numbered_attrs) if numbered_attrs: print(process_list_data(numbered_attrs, l) + par.text) else: print(par.text) if __name__ == "__main__": print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform)) rc = main(*sys.argv[1:]) print("\nDone.") sys.exit(rc)

输出

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q066374154]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32 Doc title doc subtitle heading1 text0 Paragr0 line0 Paragr0 line1 Paragr0 line2 space Paragr0 line3 a) aa (numbered) heading1 text1 Paragrx line0 Paragrx line1 a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below) 1) paragrx 1x (numbered) a) paragrx 1ax (numbered) I) paragrx 1aIx (numbered) b) paragrx 1bx (numbered) 2) paragrx 2x (numbered) 3) paragrx 3x (numbered) -- paragrx bullet 0 -- paragrx bullet 00 paragxx text Done.

注释

    仅处理来自
  • word/document.xml 的节点(通过段落的 _elementLXML 节点)属性)
  • 某些列表属性未捕获(由于
  • docx2python的限制)
  • 这离稳健还很远
  • descendantssimplified_descendants可以大大简化,但我想保持前者尽可能通用(如果需要扩展功能)

3
投票
这对我有用,使用模块

docx2python


from docx2python import docx2python document = docx2python("C:/input/MyDoc.docx") print(document.body)
    

0
投票
还有另一条路径,首先将编号转换为文本。之后您就可以照常使用

python-docx

,无需自己处理它们。

在 Word 中打开文档,打开 Visual Basic 编辑器 (

F11

),打开即时窗口 (
ctrl-G
),键入以下宏并按 Enter:

ActiveDocument.Range.ListFormat.ConvertNumbersToText
此时,您可以保存文档并运行通过

python-docx


0
投票
python-docx 似乎没有提取数字标题的功能,所以我创建了它,请参阅我的 github:

https://github.com/nguyendangson/extract_number_heading_python-docx

输入:你的docx路径

输出:数字和名称标题的列表。

def extract_number_heading(doc_path: str): ''' Input: path of a docx file Output: a list of heading names and their number headings corresponding chapters, sections, subsections,... ''' doc = Document(doc_path) # Extract chapters, section, subsection numbers heading_numbers=[] heading_name=[] for paragraph in doc.paragraphs: # Check if the paragraph is a heading if paragraph.style.name.startswith('Heading'): # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections # Save heading numbers and heading names heading_numbers.append(paragraph.style.name.split()[1]) heading_name.append((paragraph.text)) #print(heading_numbers) # Map heading_numbers to heading sections heading_sections_result = [] chapter = 1 for i in range(len(heading_numbers)): if heading_numbers[i] =='1': heading_sections_result.append(str(chapter)) chapter+=1 else: if int(heading_numbers[i]) > int(heading_numbers[i-1]): if int(heading_numbers[i]) == 2: heading_sections_result.append(heading_sections_result[-1]+'.1') else: heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2)) elif int(heading_numbers[i]) == int(heading_numbers[i-1]): u = heading_sections_result[-1].split('.') u[-1] = str(int(u[-1])+1) heading_sections_result.append('.'.join(u)) elif int(heading_numbers[i]) < int(heading_numbers[i-1]): u = heading_sections_result[-1].split('.') u = u[:int(heading_numbers[i])] u[-1] = str(int(u[-1])+1) heading_sections_result.append('.'.join(u)) # Combine heading_name and heading_sections_result total_heading = [] for i in range(len(heading_numbers)): total_heading.append(heading_sections_result[i] + ' ' + heading_name[i]) return total_heading
    
© www.soinside.com 2019 - 2024. All rights reserved.