我有一个Word文档,其结构如下:
1. Heading
1.1. Sub-heading
(a) Sub-sub-heading
当我使用代码加载
docx
中的文档时:
import docx
def getText(filename):
doc = docx.Document(filename)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
print(getText("a.docx"))
我得到以下输出。
Heading
Sub-heading
Sub-sub-heading
如何将标题/副标题数字与文本一起提取?我尝试过 simplify_docx 但这仅适用于标准 MS Word 标题样式,不适用于自定义标题样式。
不幸的是,数字不是文本的一部分,而是由 Word 本身根据标题样式生成的 (
Heading i
),而且我不认为 docx
公开任何获取此数字的方法。
但是,您可以使用
para.style
检索样式/级别,然后通读文档以重新计算编号方案。然而,这很麻烦,因为它没有考虑您可能使用的任何自定义样式。可能有一种方法可以访问文档的 style.xml
部分中的编号方案,但我不知道如何。
import docx
level_from_style_name = {f'Heading {i}': i for i in range(10)}
def format_levels(cur_lev):
levs = [str(l) for l in cur_lev if l != 0]
return '.'.join(levs) # Customize your format here
d = docx.Document('my_doc.docx')
current_levels = [0] * 10
full_text = []
for p in d.paragraphs:
if p.style.name not in level_from_style_name:
full_text.append(p.text)
else:
level = level_from_style_name[p.style.name]
current_levels[level] += 1
for l in range(level + 1, 10):
current_levels[l] = 0
full_text.append(format_levels(current_levels) + ' ' + p.text)
for l in full_text:
print(l)
来自
给我
Hello world
1 H1 foo
1.1 H2 bar
1.1.1 H3 baz
Paragraph are really nice !
1.1.2 H3 bibou
Something else
2 H1 foofoo
You got the drill…
python-docx 似乎没有提取数字标题的功能,所以我创建了它,请参阅我的 github:https://github.com/nguyendangson/extract_number_heading_python-docx
输入:你的docx路径
输出:数字和名称标题的列表。
def extract_number_heading(doc_path: str):
'''
Input: path of a docx file
Output: a list of heading names and their number headings corresponding chapters, sections, subsections,...
'''
doc = Document(doc_path)
# Extract chapters, section, subsection numbers
heading_numbers=[]
heading_name=[]
for paragraph in doc.paragraphs:
# Check if the paragraph is a heading
if paragraph.style.name.startswith('Heading'): # Heading for all chapters, sections, subsections,.., Heading 1 for chapter, Heading 2 for subsections, Heading 3 for subsubsections
# Save heading numbers and heading names
heading_numbers.append(paragraph.style.name.split()[1])
heading_name.append((paragraph.text))
#print(heading_numbers)
# Map heading_numbers to heading sections
heading_sections_result = []
chapter = 1
for i in range(len(heading_numbers)):
if heading_numbers[i] =='1':
heading_sections_result.append(str(chapter))
chapter+=1
else:
if int(heading_numbers[i]) > int(heading_numbers[i-1]):
if int(heading_numbers[i]) == 2:
heading_sections_result.append(heading_sections_result[-1]+'.1')
else:
heading_sections_result.append(heading_sections_result[-1] + '.1'*(int(heading_numbers[i])-2))
elif int(heading_numbers[i]) == int(heading_numbers[i-1]):
u = heading_sections_result[-1].split('.')
u[-1] = str(int(u[-1])+1)
heading_sections_result.append('.'.join(u))
elif int(heading_numbers[i]) < int(heading_numbers[i-1]):
u = heading_sections_result[-1].split('.')
u = u[:int(heading_numbers[i])]
u[-1] = str(int(u[-1])+1)
heading_sections_result.append('.'.join(u))
# Combine heading_name and heading_sections_result
total_heading = []
for i in range(len(heading_numbers)):
total_heading.append(heading_sections_result[i] + ' ' + heading_name[i])
return total_heading