某些 Word 文档 (.docx) 具有软回车(^l 或手动换行符)。
统计文档中的段落数时,显示只有1个段落(通过下面的脚本)。
如何识别软回车(^l或手动换行符)并将其替换为硬回车(^p或段落标记)。这样脚本就可以统计实际的段落数了?
我已经尝试过以下方法,但不起作用:
paragraph.text.replace('\r', '\n')
谢谢你。
from docx import Document
def count_paragraphs(docx_path):
doc = Document(docx_path)
return len(doc.paragraphs)
# Example usage:
docx_path = 'the_file.docx'
paragraph_count = count_paragraphs(docx_path)
print(f'The number of paragraphs in {docx_path}: {paragraph_count}')
来自文档,https://python-docx.readthedocs.io/en/latest/api/text.html#paragraph-objects
text can contain tab (\t) characters, which are converted to the appropriate XML form for a tab. text can also include newline (\n) or carriage return (\r) characters, each of which is converted to a line break
因此,我会查找每个段落的文本属性并查找换行符和回车符。通过一些测试,您可以对段落计数进行求和并增加换行符/回车符以获得正确的总数。