如何使用 python-docx 从文档中提取标题编号？

Question

我正在使用 python-docx 库从 docx 文档中提取数据，但是我也想要标题号/段落号。我想构建一个校对工具，我需要知道该信息，但是我既无法在文本中找到该信息，也无法找到段落的样式。有什么方法可以提取这些信息吗？我可以循环遍历相同标题编号的标签，但是如果用户在编写文档时没有使用正确的标题标签怎么办？或者，如果他们选择不使用默认的单词约定

1, 1.1, 1.1.1, a

而是选择使用自己的单词约定怎么办？

基本上我想要一种提取这些数字的方法，

2, 2.1, 2.2.1, (a)

。我该怎么办？

Answer 1

我尝试了类似的方法，但适用于多语言。

首先你必须观察标题（1, 2, 3 ..）和副标题（2.1, 2.2 ..）并尝试提取一些常见的东西。他们可能有以下一些独特的模式：

粗体文字
字体、大小
标题以 int(2) 开头，副标题以 float (2.1) 开头
文本之前和数字之后的分隔符（“ ”或“空格”）是什么

观察这些事情并尝试构建模式。通过使用正则表达式，我们可以提取所需的内容。

这是正则表达式，它将满足您的情况。即使是多语言。

headings = regex.search("\d+\.\t(\p{Lu}+([\s]+)?)+")
subHeadings =regex.search("\d+\.\d+\t\p{Lu}(\p{Ll}+)+")

Python 正则表达式 ( re ) 不向后兼容。因此，请使用此 [regex][1]，特别是如果您的文本是多语言的。

import regex
from docx import Document
doc = Document("<<Your doc file name here>>")

# Iterate through paragraphs ( in a word everything is a paragraph)
# Even the blank lines are paragraphs
for index, para in enumerate(doc.paragraphs):

# Skipping the blank paragraphs
    if(para.text):
        headings = regex.search("\d+\.\t(\p{Lu}+([\s]+)?)+",para.text,regex.UNICODE)
        subHeadings = regex.search("\d+\.\d+\t\p{Lu}(\p{Ll}+)+",para.text,regex.UNICODE)
        if headings:
            if para.runs:
                for run in para.runs:
                    # At run level checking for bold or italic.
                    if run.bold:
                        print("Bold Heading :",headings.group(0))
                    if run.italic:
                        print("Italic Heading :",headings.group(0))
          if subHeadings :
            if para.runs:
                for run in para.runs:
                    # At run level checking for bold or italic.
                    if run.bold:
                        print("Bold subHeadings :",subHeadings .group(0))
                    if run.italic:
                        print("Italic subHeadings :",subHeadings .group(0))

注意： 粗体或斜体并不总是出现在运行级别。如果您没有获得这些参数，您应该检查样式和参数级别。

Answer 2

python-docx 似乎没有提取数字标题的功能，所以我创建了它，请参阅我的 github：https://github.com/nguyendangson/extract_number_heading_python-docx

输入：你的docx路径

输出：数字和名称标题的列表。

def extract_number_heading(doc_path: str):
'''
Input: path of a docx file 
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,... 
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]

for paragraph in doc.paragraphs:
    # Check if the paragraph is a heading
    if paragraph.style.name.startswith('Heading'):            # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
        # Save heading numbers and heading names      
        heading_numbers.append(paragraph.style.name.split()[1])
        heading_name.append((paragraph.text))
#print(heading_numbers)

# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
    if heading_numbers[i] =='1':
        heading_sections_result.append(str(chapter))
        chapter+=1
    else:
        if int(heading_numbers[i]) > int(heading_numbers[i-1]):

            if int(heading_numbers[i]) == 2:
                heading_sections_result.append(heading_sections_result[-1]+'.1')
            else:
                heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))


        elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))
   
        elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
            u = heading_sections_result[-1].split('.')
            u = u[:int(heading_numbers[i])]
            u[-1] = str(int(u[-1])+1)
            heading_sections_result.append('.'.join(u))

# Combine heading_name and heading_sections_result 
total_heading = []
for i in range(len(heading_numbers)):
    total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])


return total_heading

如何使用 python-docx 从文档中提取标题编号？

问题描述投票：0回答：2

2个回答

最新问题

如何使用 python-docx 从文档中提取标题编号？

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2