Tesseract 训练 - 读取部首码表时出错 data/langdata/radical-stroke.txt

问题描述 投票:0回答:2

我尝试根据波兰语言模型(pol)和我自己的“基本事实”文本在特定字体上训练 Tesseract OCR - 可能很重要的是,我生成的文本不包含波兰字符集中的所有字符,因为在我的 OCR 应用中并没有使用所有这些。

基于 Ubuntu 22.04 构建的 Tesseract 5.3.2。

这是初始化训练的片段:

TESSDATA_PREFIX=/home/xxx/tesseract/tessdata make training MODEL_NAME=POLcalibri START_MODEL=pol TESSDATA=/home/xxx/tesseract/tessdata MAX_ITERATIONS=1000

训练继续进行,最后出现如下代码:

python3 shuffle.py 0 "data/POLcalibri/all-lstmf"
+ head -n 134999 data/POLcalibri/all-lstmf
+ tail -n 15000 data/POLcalibri/all-lstmf
+ '[' '' = Windows_NT ']'
if [ "" = "Windows_NT" ]; then \
    dos2unix "data/POLcalibri/POLcalibri.numbers"; \
    dos2unix "data/POLcalibri/POLcalibri.punc"; \
    dos2unix "data/POLcalibri/POLcalibri.wordlist"; \
    dos2unix "data/langdata/POLcalibri/POLcalibri.config"; \
fi
combine_lang_model \
  --input_unicharset data/POLcalibri/unicharset \
  --script_dir data/langdata \
  --numbers data/POLcalibri/POLcalibri.numbers \
  --puncs data/POLcalibri/POLcalibri.punc \
  --words data/POLcalibri/POLcalibri.wordlist \
  --output_dir data \
   \
  --lang POLcalibri
Failed to read data from: data/POLcalibri/POLcalibri.wordlist
Failed to read data from: data/POLcalibri/POLcalibri.punc
Failed to read data from: data/POLcalibri/POLcalibri.numbers
Loaded unicharset of size 121 from file data/POLcalibri/unicharset
Setting unichar properties
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Warning: properties incomplete for index 3 = P
Warning: properties incomplete for index 4 = O
Warning: properties incomplete for index 5 = T
Warning: properties incomplete for index 6 = R
Warning: properties incomplete for index 7 = Z
Warning: properties incomplete for index 8 = E
Warning: properties incomplete for index 9 = B
Warning: properties incomplete for index 10 = N
Warning: properties incomplete for index 11 = )
Warning: properties incomplete for index 12 = G
Warning: properties incomplete for index 13 = U
Warning: properties incomplete for index 14 = J
Warning: properties incomplete for index 15 = !
Warning: properties incomplete for index 16 = ,
Warning: properties incomplete for index 17 = W
Warning: properties incomplete for index 18 = C
Warning: properties incomplete for index 19 = Ł
Warning: properties incomplete for index 20 = A
Warning: properties incomplete for index 21 = S
Warning: properties incomplete for index 22 = K
Warning: properties incomplete for index 23 = I
Warning: properties incomplete for index 24 = '
Warning: properties incomplete for index 25 = M
Warning: properties incomplete for index 26 = L
Warning: properties incomplete for index 27 = D
Warning: properties incomplete for index 28 = .
Warning: properties incomplete for index 29 = Ę
Warning: properties incomplete for index 30 = H
Warning: properties incomplete for index 31 = ?
Warning: properties incomplete for index 32 = Y
Warning: properties incomplete for index 33 = "
Warning: properties incomplete for index 34 = Ż
Warning: properties incomplete for index 35 = :
Warning: properties incomplete for index 36 = V
Warning: properties incomplete for index 37 = 6
Warning: properties incomplete for index 38 = 0
Warning: properties incomplete for index 39 = 8
Warning: properties incomplete for index 40 = F
Warning: properties incomplete for index 41 = Ą
Warning: properties incomplete for index 42 = Ć
Warning: properties incomplete for index 43 = Ś
Warning: properties incomplete for index 44 = /
Warning: properties incomplete for index 45 = Ó
Warning: properties incomplete for index 46 = _
Warning: properties incomplete for index 47 = (
Warning: properties incomplete for index 48 = Ń
Warning: properties incomplete for index 49 = ;
Warning: properties incomplete for index 50 = -
Warning: properties incomplete for index 51 = Q
Warning: properties incomplete for index 52 = X
Warning: properties incomplete for index 53 = |
Warning: properties incomplete for index 54 = „
Warning: properties incomplete for index 55 = 2
Warning: properties incomplete for index 56 = 3
Warning: properties incomplete for index 57 = 1
Warning: properties incomplete for index 58 = 7
Warning: properties incomplete for index 59 = 9
Warning: properties incomplete for index 60 = ”
Warning: properties incomplete for index 61 = +
Warning: properties incomplete for index 62 = ]
Warning: properties incomplete for index 63 = [
Warning: properties incomplete for index 64 = 4
Warning: properties incomplete for index 65 = 5
Warning: properties incomplete for index 66 = =
Warning: properties incomplete for index 67 = Ź
Warning: properties incomplete for index 68 = »
Warning: properties incomplete for index 69 = <
Warning: properties incomplete for index 70 = >
Warning: properties incomplete for index 71 = *
Warning: properties incomplete for index 72 = $
Warning: properties incomplete for index 73 = «
Warning: properties incomplete for index 74 = %
Warning: properties incomplete for index 75 = ©
Warning: properties incomplete for index 76 = €
Warning: properties incomplete for index 77 = —
Warning: properties incomplete for index 78 = £
Warning: properties incomplete for index 79 = l
Warning: properties incomplete for index 80 = o
Warning: properties incomplete for index 81 = r
Warning: properties incomplete for index 82 = e
Warning: properties incomplete for index 83 = n
Warning: properties incomplete for index 84 = t
Warning: properties incomplete for index 85 = y
Warning: properties incomplete for index 86 = ń
Warning: properties incomplete for index 87 = c
Warning: properties incomplete for index 88 = z
Warning: properties incomplete for index 89 = k
Warning: properties incomplete for index 90 = m
Warning: properties incomplete for index 91 = b
Warning: properties incomplete for index 92 = s
Warning: properties incomplete for index 93 = a
Warning: properties incomplete for index 94 = j
Warning: properties incomplete for index 95 = d
Warning: properties incomplete for index 96 = g
Warning: properties incomplete for index 97 = ł
Warning: properties incomplete for index 98 = ę
Warning: properties incomplete for index 99 = p
Warning: properties incomplete for index 100 = w
Warning: properties incomplete for index 101 = i
Warning: properties incomplete for index 102 = v
Warning: properties incomplete for index 103 = u
Warning: properties incomplete for index 104 = f
Warning: properties incomplete for index 105 = h
Warning: properties incomplete for index 106 = ó
Warning: properties incomplete for index 107 = x
Warning: properties incomplete for index 108 = ą
Warning: properties incomplete for index 109 = ż
Warning: properties incomplete for index 110 = ś
Warning: properties incomplete for index 111 = q
Warning: properties incomplete for index 112 = ć
Warning: properties incomplete for index 113 = ź
Warning: properties incomplete for index 114 = á
Warning: properties incomplete for index 115 = Ü
Warning: properties incomplete for index 116 = ü
Warning: properties incomplete for index 117 = ’
Warning: properties incomplete for index 118 = Ű
Warning: properties incomplete for index 119 = ű
Warning: properties incomplete for index 120 = Á
Config file is optional, continuing...
Failed to read data from: data/langdata/POLcalibri/POLcalibri.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:309: data/POLcalibri/POLcalibri.traineddata] Error 1

我不知道如何解决,在 GitHub 上提出了类似的问题,但没有解决方案。

lstm ocr tesseract leptonica tesseract-5.x
2个回答
0
投票

radical-lines.txt 下载到

data/langdata/
怎么样?

顺便说一句:在发布到 SO 之前尝试阅读说明


0
投票

我也遇到了同样的错误,也是波兰语。

就我而言,此错误是由于使用“默认”

tessdata
存储库中的.traineddata文件而不是tessdata_best存储库引起的。

来自 tesseract 文档

tessdata_best
适合那些愿意牺牲大量速度来换取稍微更好的准确性的人。它也是唯一可用作高级用户某些再训练场景的 start_model 的文件。

© www.soinside.com 2019 - 2024. All rights reserved.