我一直在使用 tesseract 读取各种数字(最多 99,999.9),格式如下:
似乎大约 80% 的时间都能正确读取,但我需要 95% 的准确度。
async function runOCR(url) {
const worker = await Tesseract.createWorker('eng', 1, {
tessedit_pageseg_mode: 13,
config: '--psm 13'
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_ocr_engine_mode: Tesseract.OEM_TESSERACT_ONLY,
tessedit_char_whitelist: '0123456789,.',
preserve_interword_spaces: '0',
SINGLE_WORD: true,
tessedit_pageseg_mode: Tesseract.SINGLE_WORD,
});
const {
data: { text },
} = await worker.recognize(url);
doSomething(text);
await worker.terminate();
})();
}
主要问题是我不知道在哪里设置页面分割模式(PSM,pageseg)。我找到的示例要么已过时,要么采用其他语言。
这是我从C文件中找到的pageseg选项列表(https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163)
PSM_OSD_ONLY, ///< Orientation and script detection only.
PSM_AUTO_OSD, ///< Automatic page segmentation with orientation and
///< script detection. (OSD)
PSM_AUTO_ONLY, ///< Automatic page segmentation, but no OSD, or OCR.
PSM_AUTO, ///< Fully automatic page segmentation, but no OSD.
PSM_SINGLE_COLUMN, ///< Assume a single column of text of variable sizes.
PSM_SINGLE_BLOCK_VERT_TEXT, ///< Assume a single uniform block of vertically
///< aligned text.
PSM_SINGLE_BLOCK, ///< Assume a single uniform block of text. (Default.)
PSM_SINGLE_LINE, ///< Treat the image as a single text line.
PSM_SINGLE_WORD, ///< Treat the image as a single word.
PSM_CIRCLE_WORD, ///< Treat the image as a single word in a circle.
PSM_SINGLE_CHAR, ///< Treat the image as a single character.
PSM_SPARSE_TEXT, ///< Find as much text as possible in no particular order.
PSM_SPARSE_TEXT_OSD, ///< Sparse text with orientation and script det.
PSM_RAW_LINE, ///< Treat the image as a single text line, bypassing
///< hacks that are Tesseract-specific.
如何更好地检测图像中的数字或如何正确设置页面分割模式/配置? (我所做的配置更改似乎对我的命中率没有影响)
我在
tessedit_pageseg_mode: 13,
中看到 createWorker
,然后在 tessedit_pageseg_mode: Tesseract.SINGLE_WORD
中看到 worker.setParameters
。recognize
函数之前设置此参数(页面分割模式)一次。
要检测图像中的单个数字(例如您提供的图像),您应该使用
PSM_SINGLE_LINE
或 PSM_SINGLE_WORD
,它们似乎专门针对此类任务进行了优化。
async function runOCR(url) {
const worker = await Tesseract.createWorker({
logger: m => console.log(m)
});
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
// Set only the necessary parameters once
await worker.setParameters({
tessedit_ocr_engine_mode: Tesseract.OEM_TESSERACT_ONLY,
tessedit_char_whitelist: '0123456789.,',
tessedit_pageseg_mode: Tesseract.PSM_SINGLE_LINE // or PSM_SINGLE_WORD if a line does not work well
});
// Now recognize the number in the image
const { data: { text } } = await worker.recognize(url);
doSomething(text);
await worker.terminate();
}