python-docx:提取文本以及标题和子标题数字

问题描述 投票:0回答:2

我有一个Word文档,其结构如下:

1. Heading
    1.1. Sub-heading
        (a) Sub-sub-heading

当我使用代码加载

docx
中的文档时:

import docx

def getText(filename):
    doc = docx.Document(filename)
    fullText = []
    for para in doc.paragraphs:
        fullText.append(para.text)
    return '\n'.join(fullText)
print(getText("a.docx"))

我得到以下输出。

Heading
Sub-heading
Sub-sub-heading

如何将标题/副标题数字与文本一起提取?我尝试过 simplify_docx 但这仅适用于标准 MS Word 标题样式,不适用于自定义标题样式。

python docx python-docx
2个回答
4
投票

不幸的是,数字不是文本的一部分,而是由 Word 本身根据标题样式生成的 (

Heading i
),而且我不认为
docx
公开任何获取此数字的方法。

但是,您可以使用

para.style
检索样式/级别,然后通读文档以重新计算编号方案。然而,这很麻烦,因为它没有考虑您可能使用的任何自定义样式。可能有一种方法可以访问文档的
style.xml
部分中的编号方案,但我不知道如何。

import docx

level_from_style_name = {f'Heading {i}': i for i in range(10)}

def format_levels(cur_lev):
    levs = [str(l) for l in cur_lev if l != 0]
    return '.'.join(levs)  # Customize your format here

d = docx.Document('my_doc.docx')

current_levels = [0] * 10
full_text = []

for p in d.paragraphs:
    if p.style.name not in level_from_style_name:
        full_text.append(p.text)
    else:
        level = level_from_style_name[p.style.name]
        current_levels[level] += 1
        for l in range(level + 1, 10):
            current_levels[l] = 0
        full_text.append(format_levels(current_levels) + ' ' + p.text)

for l in full_text:
    print(l)

来自

给我

Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…

0
投票

python-docx 似乎没有提取数字标题的功能,所以我创建了它,请参阅我的 github:https://github.com/nguyendangson/extract_number_heading_python-docx

输入:你的docx路径

输出:数字和名称标题的列表。

def extract_number_heading(doc_path: str):
'''
Input: path of a docx file 
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,... 
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]

for paragraph in doc.paragraphs:
    # Check if the paragraph is a heading
    if paragraph.style.name.startswith('Heading'):            # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
        # Save heading numbers and heading names      
        heading_numbers.append(paragraph.style.name.split()[1])
        heading_name.append((paragraph.text))
#print(heading_numbers)

# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
    if heading_numbers[i] =='1':
        heading_sections_result.append(str(chapter))
        chapter+=1
    else:
        if int(heading_numbers[i]) > int(heading_numbers[i-1]):

            if int(heading_numbers[i]) == 2:
                heading_sections_result.append(heading_sections_result[-1]+'.1')
            else:
                heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))


        elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))
   
        elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u = u[:int(heading_numbers[i])]
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))

# Combine heading_name and heading_sections_result 
total_heading = []
for i in range(len(heading_numbers)):
    total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])


return total_heading
© www.soinside.com 2019 - 2024. All rights reserved.