python-docx：提取文本以及标题和子标题数字

Question

我有一个Word文档，其结构如下：

1. Heading
    1.1. Sub-heading
        (a) Sub-sub-heading

当我使用代码加载

docx

中的文档时：

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("a.docx"))

我得到以下输出。

Heading
Sub-heading
Sub-sub-heading

如何将标题/副标题数字与文本一起提取？我尝试过 simplify_docx 但这仅适用于标准 MS Word 标题样式，不适用于自定义标题样式。

Answer 1

不幸的是，数字不是文本的一部分，而是由 Word 本身根据标题样式生成的 (

Heading i

)，而且我不认为

docx

公开任何获取此数字的方法。

但是，您可以使用

para.style

检索样式/级别，然后通读文档以重新计算编号方案。然而，这很麻烦，因为它没有考虑您可能使用的任何自定义样式。可能有一种方法可以访问文档的

style.xml

部分中的编号方案，但我不知道如何。

import docx

level_from_style_name = {f'Heading {i}': i for i in range(10)}

def format_levels(cur_lev):
    levs = [str(l) for l in cur_lev if l != 0]
    return '.'.join(levs)  # Customize your format here

d = docx.Document('my_doc.docx')

current_levels = [0] * 10
full_text = []

for p in d.paragraphs:
    if p.style.name not in level_from_style_name:
        full_text.append(p.text)
    else:
        level = level_from_style_name[p.style.name]
        current_levels[level] += 1
        for l in range(level + 1, 10):
            current_levels[l] = 0
        full_text.append(format_levels(current_levels) + ' ' + p.text)

for l in full_text:
    print(l)

来自

给我

Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…

Answer 2

python-docx 似乎没有提取数字标题的功能，所以我创建了它，请参阅我的 github：https://github.com/nguyendangson/extract_number_heading_python-docx

输入：你的docx路径

输出：数字和名称标题的列表。

def extract_number_heading(doc_path: str):
'''
Input: path of a docx file 
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,... 
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]

for paragraph in doc.paragraphs:
    # Check if the paragraph is a heading
    if paragraph.style.name.startswith('Heading'):            # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
        # Save heading numbers and heading names      
        heading_numbers.append(paragraph.style.name.split()[1])
        heading_name.append((paragraph.text))
#print(heading_numbers)

# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
    if heading_numbers[i] =='1':
        heading_sections_result.append(str(chapter))
        chapter+=1
    else:
        if int(heading_numbers[i]) > int(heading_numbers[i-1]):

            if int(heading_numbers[i]) == 2:
                heading_sections_result.append(heading_sections_result[-1]+'.1')
            else:
                heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))


        elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))
   
        elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u = u[:int(heading_numbers[i])]
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))

# Combine heading_name and heading_sections_result 
total_heading = []
for i in range(len(heading_numbers)):
    total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])


return total_heading

python-docx：提取文本以及标题和子标题数字

问题描述投票：0回答：2

2个回答

最新问题

python-docx：提取文本以及标题和子标题数字

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2