如何使用python从docx文件的标题下提取文本

问题描述 投票:0回答:1

我正在寻求在docx文件中的标题下提取文本。文本结构有点像这样:

1. DESCRIPTION
  Some text here

2. TERMS AND SERVICES
 2.1 Some text here
 2.2 Some text here

3. PAYMENTS AND FEES
  Some text here

我正在寻找的东西是这样的:

['1. DESCRIPTION','Some text here']
['2. TERMS AND SERVICES','2.1 Some text here 2.2 Some text here']
['3. PAYMENTS AND FEES', 'Some text here']

我尝试使用python-docx库:

from docx import Document

document = Document('Test.docx')

def iter_headings(paragraphs):
    for paragraph in paragraphs:
        if paragraph.style.name.startswith('Normal'):
            yield paragraph
for heading in iter_headings(document.paragraphs):
    print (heading.text)

我在普通,正文和标题#之间具有不同的样式。有时有时标题为“普通”,而该部分的文本为“正文文本”样式。有人可以指导我朝正确的方向发展吗?会非常感谢。

python-3.x python-docx
1个回答
0
投票

您有办法。

提取内容后,只需将带有“ Normal”(大写)和“ BOLD”(粗体)的部分也标记为标题。但是您必须谨慎地处理这种逻辑,以免影响普通段落中存在的粗体字符,即(普通段落中存在的粗体字符只是为了突出该段落中的一个重要术语)。

您可以执行以下操作:浏览每个段落,然后遍历该段落的所有运行,以检查“该段落中的所有运行都是粗体的”。因此,如果特定“常规”段落中的所有运行都具有“ BOLD”属性,则可以得出结论,它是“标题”。

要应用上述逻辑,您可以在迭代文档段落时使用以下代码:


#Iterate over paragraphs
for paragraph in document.paragraphs:
    #Start of by initializing an empty string to store bold words inside a run
    runboldtext = ''
    # Iterate over all runs of the current paragraph and collect all the words which are bold into the varible "runboldtext"
    for run in paragraph.text:                        
        if run.bold:
            runboldtext = runboldtext + run.text
    # Now check if the value of "runboldtext" matches the entire paragraph text. If it matches, it means all the words in the current paragraph are bold and can be considered as a heading
    if runboldtext == str(paragraph.text) and runboldtext != '':
        print("Heading True for the paragraph: ",runboldtext)
        style_of_current_paragraph = 'Heading'
© www.soinside.com 2019 - 2024. All rights reserved.