如何(快速)批处理多个图像并通过tesseract运行

问题描述 投票:0回答:1

我已经使用magick-r和tesseract的组合从一个pdf文件中成功提取了文本,但是在尝试处理多个图像时遇到了障碍。(这是针对非营利组织的)

我欢迎以bash开头的答案,但要求它们是全面的,不要跳过tesseract组件。

this question的答案用于不使用OCR的图像清洁,因此不确定在这里如何集成第一个答案。

图像数据:enter image description here

我的过程:

library(tesseract)
library(dplyr)
library(stringr)
library(pdftools)
library(readr)
library(magick)
library(purrr)
# original data
#pdf <- https://github.com/pembletonc/Project44_Text_Extraction/blob/master/test-data/001_0145.pdf

#image file (note that size here doesn't match processing below because of 2mb limit)[![enter image description here][2]][2]

file_name <- tools::list_files_with_exts(dir = "./test-data", exts = "pdf")
page_count <- pdf_info(file_name)$pages  

multi_files <- list(pdftools::pdf_convert(file_name, page = 1:page_count,
                                          filenames = paste0("./test-data/", "page", 1:page_count, ".png"),dpi = 250))

#or just get the file extensions for the file if already created[![enter image description here][1]][1]
#multi_files <- list(tools::list_files_with_exts(dir = "./test-data", exts = "png"))

将图像读取为魔术文件:

multi_images <- map(multi_files, image_read)

which creates a tibble magick pointer object with the images sort of joined as a frame:

[[1]]
# A tibble: 5 x 7
  format width height colorspace matte filesize density
  <chr>  <int>  <int> <chr>      <lgl>    <int> <chr>  
1 PNG     3243   2010 sRGB       FALSE        0 98x98  
2 PNG     3247   2013 sRGB       FALSE  4515441 98x98  
3 PNG     3243   2013 sRGB       FALSE  4559229 98x98  
4 PNG     3247   2010 sRGB       FALSE  4270145 98x98  
5 PNG     3247   2010 sRGB       FALSE  3212528 98x98  

如何在每个PNG上访问它,以便可以在OCR中进行清理和处理?

multi_text_clean <- function(images){

  Map(function(x) {
    x %>% 
      image_crop(geometry_area(width = 2200, height = 1600, y_off = 500, x_off = 650)) %>%  
      image_resize("2000x") %>%
      image_background("white", flatten = TRUE) %>% 
      image_noise(noisetype = "Uniform") %>%          # Reduce noise in image using a noise peak elimination filter
      image_enhance() %>%                             # Enhance image (minimize noise)
      image_normalize() %>% 
      image_convert(type = 'Grayscale') %>%
      image_trim(fuzz = 40) %>%
      image_contrast(sharpen = 1) %>%
      #image_deskew(threshold = 40) %>% 
      image_write(format = 'png', density = '300x300') %>%
      tesseract::ocr(tesseract(options = list(preserve_interword_spaces = 1)))
  }, images)

}

仅在第一张图像上运行:

text_list <-  multi_text_clean(multi_images)
(text_multi <- stringr::str_split(text_list, pattern = "\\s{5,}"))

[[1]]
 [1] "Weather clear all day. A small arms inspection held at 1400 hrs. A recce party went\njout consisting of Coy Comds and Lt Col Nicklin, I.0. and Asst Adjt. An Orders group\nheld in the evening. Pay parade for HQ and Bn HQ was at 1900 hrs. A movie was shown\nfor B Coy personnel by our YMCA Supervisor."                                                                                                                                                                                                                                                                                                                                                                                                                                                                           
 [2] ")\nWeather clear and cold all day. Personnel packed equipment early in the morning and |~\nwere ready to move at 0830 hrs. Unit embussed at 0900 hrs and moved to Rochefort, MR\n2076, Sheet 105, 1/25000, arriving at 1390 hrs. Coys were in position at 1600 hrs. |,,\nPW brought in by A Coy at 1800 hrs. PW was a deserter from 304 Regt 2 Pz division.\nNo other activity during the day. Patrols were sent out during the night by all coys}) u\nCold all day. Very quiet all morning. A Coy moved forward. Coy HQ set up at Chateawv .\n\\Vieux de Rochefort. Slight opposition met by A Coy on advance. Opposition met at\n\\Croic St Jean. A Coy was in position at 1700 hrs. Advance started at 1500 hrs. OP\nset up at 1900 hrs at MR 207753. Patrols sent out by all Coys."
 [3] "“y\neather wet all day. Snowed most of the day. 1 Pl from C Coy guarding bridge MR\n204767. A Coy sent a fighting patrol to clear Powder Mill woods MR 2074. Recce\npatrols sent out byall coys."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
 [4] "f\nWeather fair all day. No enemy was seen during the day. A Coy sent out patrols during\ntthe day and night but no opposition memt. B Coy moved forward to MR 195771. Orders\nGroup held at 2000 hrs and orders were given to have all personnel ready to move to\nnew location by 1200 hrs on the 6 of Jan 1945. YMCA was to show a movie in the evenp\nling but the CO cancelled it. Two Polish deserters from the German army walked into\n|A Coy lines."                                                                                                                                                                                                                                                                                                                          
 [5] "iz\nWeather clear all day. CO, Coy Comds, Sig Officer and Vickers Officer left to recce\nnew location at 0830 hrs. Unit started to move to new location at 1200 hrs, Unit   Bs\narrived at AYE MR 2683, Sheet 91, 1\" to mile at 1500 hrs. Personnel were shown to\ntheir areas and billets."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [6] "| 9\neather clear all day. Observation Post set up by the Intelligence Sec at MR 253813.| |\nQuiet all day. No enemy activity during the day."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         
 [7] "|\neather overcast and snowing. Intelligence Section set up another OP at MR 268814.\nNo enemy activity during the day. At 2300 hrs orders were received that all personnel\nere to be ready to move to new area on the morning of the 9th Jan, 1945."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [8] ":"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     
 [9] "‘\nWeather clear and cold, Bm started to move at 0830 hrs. Bn reached Champlon"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        
[10] "&\nFamenine, MR 3182 at 1230 hrs. Bn relieved the HLI. Coys immediately took up"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[11] ":\npositions for all around defence."                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[12] "4\n"                                                                                                                                                                                                                             

我如何在该magick对象中的每个图像中运行它?

r imagemagick tesseract rmagick
1个回答
0
投票

您可以在ImageMagick中执行以下操作。

输入:

© www.soinside.com 2019 - 2024. All rights reserved.