使用python从pdf中检测语言/脚本

Question

我正在尝试创建一个 python 脚本来在 pytesseract 的帮助下检测尚未 OCRed 的 pdf 中的语言/脚本，然后通过传递正确的检测到的语言来进行“真正的”ocr

我有大约 10000 个 pdf，并不总是标准英语，有时有 1000 页长。为了进行真正的 OCR，我需要首先自动检测语言。

因此，一种两步 OCR，正如您所愿，超正方体都可以执行

检测某些居中页面上的语言/脚本
使用找到的语言/脚本在所有页面上执行真正的 OCR

有任何修复/改进此脚本的技巧吗？我想要的只是返回检测到的给定页面上的语言。

#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz

pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
    print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice  =0, timeout=0))

我运行脚本如下：

script_detect.py myunknown.pdf

我在 atm 上收到以下错误：

TypeError: Unsupported image object

Answer 1

假设您已经使用某种工具（OCR 或其他）转换了 pdf 文件，您可以使用

langdetect

。对您的文本进行采样并输入

detect

from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)

output fr

或

from langdetect import detect
lang = detect("我是法国人")
print(lang)

output ch

还有其他库，例如多语言库，如果您使用混合语言，则很有用。

使用python从pdf中检测语言/脚本

问题描述投票：0回答：1

1个回答

最新问题

使用python从pdf中检测语言/脚本

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1