Tesseract5-OCR 训练 - 分段故障错误

问题描述 投票:0回答:1

我正在尝试使用新字体训练 tesseract 5。我在 WSL Ubuntu 上运行 tesseract,并且遵循了 Gabriel Garcia 的教程和官方 tesseract 编译文档。我正在尝试在来自 tessdata_best 的 eng.traineddata 文件之上训练 tesseract,该文件包含在 tesseract/tessdata 目录中。我还在 tesstrain/data/$(MODEL_NAME)-ground-truth 目录中提供了训练数据(tif、box、gt 文件)。

当我运行火车命令时

TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=eng START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100

我得到以下结果

user@DESKTOP:~/tesseract-ocr/tesstrain$ TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Calibri START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Calibri
Extracting tessdata components from ../tesseract/tessdata/eng.traineddata
Wrote data/eng/Calibri.lstm
Wrote data/eng/Calibri.lstm-punc-dawg
Wrote data/eng/Calibri.lstm-word-dawg
Wrote data/eng/Calibri.lstm-number-dawg
Wrote data/eng/Calibri.lstm-unicharset
Wrote data/eng/Calibri.lstm-recoder
Wrote data/eng/Calibri.version
Version:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
unicharset_extractor --output_unicharset "data/Calibri/my.unicharset" --norm_mode 2 "data/Calibri/all-gt"
Extracting unicharset from plain text file data/Calibri/all-gt
Other case É of é is not in unicharset
Wrote unicharset file data/Calibri/my.unicharset
merge_unicharsets data/eng/Calibri.lstm-unicharset data/Calibri/my.unicharset "data/Calibri/unicharset"
Loaded unicharset of size 112 from file data/eng/Calibri.lstm-unicharset
Loaded unicharset of size 112 from file data/Calibri/my.unicharset
Wrote unicharset file data/Calibri/unicharset.
python3 shuffle.py 0 "data/Calibri/all-lstmf"
/bin/bash: line 2: bc: command not found
/bin/bash: line 5: bc: command not found
+ head -n '' data/Calibri/all-lstmf
head: invalid number of lines: ''
+ tail -n '' data/Calibri/all-lstmf
tail: invalid number of lines: ''
+ '[' '' = Windows_NT ']'
if [ "" = "Windows_NT" ]; then \
        dos2unix "data/Calibri/Calibri.numbers"; \
        dos2unix "data/Calibri/Calibri.punc"; \
        dos2unix "data/Calibri/Calibri.wordlist"; \
        dos2unix "data/langdata/Calibri/Calibri.config"; \
fi
combine_lang_model \
  --input_unicharset data/Calibri/unicharset \
  --script_dir data/langdata \
  --numbers data/Calibri/Calibri.numbers \
  --puncs data/Calibri/Calibri.punc \
  --words data/Calibri/Calibri.wordlist \
  --output_dir data \
   \
  --lang Calibri
Failed to read data from: data/Calibri/Calibri.wordlist
Failed to read data from: data/Calibri/Calibri.punc
Failed to read data from: data/Calibri/Calibri.numbers
Loaded unicharset of size 112 from file data/Calibri/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 47 = ~
Config file is optional, continuing...
Failed to read data from: data/langdata/Calibri/Calibri.config
Null char=2
Created data/Calibri/Calibri.traineddatalstmtraining \
  --debug_interval 0 \
  --traineddata data/Calibri/Calibri.traineddata \
  --old_traineddata ../tesseract/tessdata/eng.traineddata \
  --continue_from data/eng/Calibri.lstm \
  --learning_rate 0.0001 \
  --model_output data/Calibri/checkpoints/Calibri \
  --train_listfile data/Calibri/list.train \
  --eval_listfile data/Calibri/list.eval \
  --max_iterations 100 \
  --target_error_rate 0.01
Failed to load list of training filenames from data/Calibri/list.train
make: *** [Makefile:324: data/Calibri/checkpoints/Calibri_checkpoint] Error 1

我尝试手动添加 list.train 文件中 lstm 文件的路径。错误

Failed to load list of training filenames from data/Calibri/list.train 

上述错误停止了,当我再次运行 train 命令时,我现在收到此错误

user@DESKTOP:~/tesseract-ocr/tesstrain$ TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Calibri START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
lstmtraining \
  --debug_interval 0 \
  --traineddata data/Calibri/Calibri.traineddata \
  --old_traineddata ../tesseract/tessdata/eng.traineddata \
  --continue_from data/eng/Calibri.lstm \
  --learning_rate 0.0001 \
  --model_output data/Calibri/checkpoints/Calibri \
  --train_listfile data/Calibri/list.train \
  --eval_listfile data/Calibri/list.eval \
  --max_iterations 100 \
  --target_error_rate 0.01
Loaded file data/eng/Calibri.lstm, unpacking...
Warning: LSTMTrainer deserialized an LSTMRecognizer!
Code range changed from 111 to 111!
Num (Extended) outputs,weights in Series:
  1,36,0,1:1, 0
Num (Extended) outputs,weights in Series:
  C3,3:9, 0
  Ft16:16, 160
Total weights = 160
  [C3,3Ft16]:16, 160
  Mp3,3:16, 0
  TxyLfys64:64, 20736
  Lfx96:96, 61824
  RxLrx96:96, 74112
  Lfx512:512, 1247232
  Fc111:111, 56943
Total weights = 1461007
Previous null char=110 mapped to 110
Continuing from data/eng/Calibri.lstm
make: *** [Makefile:324: data/Calibri/checkpoints/Calibri_checkpoint] Segmentation fault

我在互联网上搜索过,我发现的最接近的就是在 tesseract github 页面上打开的这个问题。通过将训练数据文件从快速训练数据更改为最佳训练数据,本文中提出的问题得到了解决。但这对我不起作用。

提前致谢

makefile ocr tesseract windows-subsystem-for-linux tesseract-5.x
1个回答
0
投票

我也遵循他的 youtube 教程,一旦我运行该 cmd,我收到此错误: 错误

© www.soinside.com 2019 - 2024. All rights reserved.