如何在Python中获取word文档的字数？

Question

我正在尝试获取 .doc .docx .odt 和 .pdf 类型文件的字数。这对于 .txt 文件来说非常简单，但我如何对上述类型进行字数统计？

我在 Ubuntu 上使用 python django，并尝试在用户通过系统上传文件时对文档单词进行字数统计。

Answer 1

首先您需要阅读您的 .doc .docx .odt 和 .pdf。

第二，数单词 (<2.7 version)。

Answer 2

这些答案错过了有关 MS Word 和 .odt 的技巧。

每当保存 .docx 文件时，

MS Word 都会记录该文件的字数。 .docx 文件只是一个 zip 文件。访问其中的“Words”（=字数）属性很简单，可以使用标准库中的模块来完成：

import zipfile
import xml.etree.ElementTree as ET

total_word_count = 0
for docx_file_path in docx_file_paths:
    zin = zipfile.ZipFile(docx_file_path)
    for item in zin.infolist():
        if item.filename == 'docProps/app.xml':
            buffer = zin.read(item.filename)
            root = ET.fromstring(buffer.decode('utf-8'))
            for child in root:
                if child.tag.endswith('Words'):
                    print(f'{docx_file_path} word count {child.text}')
                    total_word_count += int(child.text)
                    
print(f'total word count all files {total_word_count}')

优点和缺点：主要优点是，对于大多数文件来说，这将比其他任何东西都快远。

主要的缺点是你被 MS Word 计数方法的各种特性所困扰：我对细节不是特别感兴趣，但我知道这些在版本中已经发生了变化（例如，文本框中的单词可能会或可能不会被包含在内））。但是，如果您选择拆分并解析 .docx 文件的整个文本内容，也会出现同样的复杂情况。各种可用的模块，例如python-docx，似乎做得相当不错，但根据我的经验，没有一个是完美的。

如果您自己实际提取并解析 .docx 文件中的 content.xml 文件，您就会开始意识到其中涉及到一些令人畏惧的复杂性。

.odt 文件
同样，这些是 zip 文件，并且在 meta.xml 中也发现了类似的属性。我刚刚创建并解压了一个这样的文件，其中的 meta.xml 如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<office:document-meta xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:grddl="http://www.w3.org/2003/g/data-view#" office:version="1.3">
    <office:meta>
        <meta:creation-date>2023-06-11T18:25:09.898000000</meta:creation-date>
        <dc:date>2023-06-11T18:25:21.656000000</dc:date>
        <meta:editing-duration>PT11S</meta:editing-duration>
        <meta:editing-cycles>1</meta:editing-cycles>
        <meta:document-statistic meta:table-count="0" meta:image-count="0" meta:object-count="0" meta:page-count="1" meta:paragraph-count="1" meta:word-count="2" meta:character-count="12" meta:non-whitespace-character-count="11"/>
        <meta:generator>LibreOffice/7.4.6.2$Windows_X86_64 LibreOffice_project/5b1f5509c2decdade7fda905e3e1429a67acd63d</meta:generator>
    </office:meta>
</office:document-meta>

因此您需要查看

root['office:meta']['meta:document-statistic']

，属性

meta:word-count

。

我不了解 PDF：它们很可能需要强力计数。 Pypdf2 看起来可行：最简单的方法是转换为 txt 并以此方式进行计数。我不知道可能会错过什么。
例如，扫描的 PDF 可能有数百页长，但据说包含“0 个字”。或者确实可能有扫描文本散布着真正的文本内容......

Answer 3

鉴于您可以对 .txt 文件执行此操作，我假设您知道如何计算单词数，并且您只需要知道如何读取各种文件类型。看看这些库：

PDF：pypdf

doc/docx：这个问题，python-docx

odt：示例在这里

Answer 4

@Chad 在 extracting text from MS word files in python 中的回答指出。

import zipfile, re

docx = zipfile.ZipFile('/path/to/file/mydocument.docx')

content = docx.read('word/document.xml').decode('utf-8')
cleaned = re.sub('<(.|\n)*?>','',content)

word_count = len(cleaned)

如何在Python中获取word文档的字数？

问题描述投票：0回答：4

4个回答

最新问题

如何在Python中获取word文档的字数？

问题描述 投票：0回答：4

4个回答

最新问题

问题描述投票：0回答：4