使用 R 进行 OCR

Question

我一直在尝试在R中进行OCR（读取PDF数据，其中数据为扫描图像）。一直在阅读有关此内容@http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/

这是一篇非常好的帖子。

有效的3个步骤：

将 pdf 转换为 ppm（图像格式）
将 ppm 转换为 tif 准备用于超立方体（使用 ImageMagick 进行转换）
将 tif 转换为文本文件

根据链接帖子，上述 3 个步骤的有效代码：

lapply(myfiles, function(i){
  # convert pdf to ppm (an image format), just pages 1-10 of the PDF
  # but you can change that easily, just remove or edit the 
  # -f 1 -l 10 bit in the line below
  shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
  # convert ppm to tif ready for tesseract
  shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
  # convert tif to text file
  shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
  # delete tif file
  file.remove(paste0(i, ".tif" ))
  })

前两步进展顺利。（虽然花了很多时间，对于 4 页的 pdf，但稍后会研究可扩展性部分，首先尝试这是否有效）

运行此程序时，前两步工作正常。

运行第三步时，即

shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))

我遇到这个错误：

错误：求值嵌套太深：无限递归/选项（表达式=）？

或者 Tesseract 崩溃了。

任何解决方法或根本原因分析将不胜感激。

Answer 1

通过使用“tesseract”，我创建了一个可以运行的示例脚本。甚至它也适用于扫描的 PDF。

library(tesseract)
library(pdftools)

# Render pdf to png image

img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff',  dpi = 400)

# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")

我是 R 和编程新手。如有错误请指导。希望这对您有帮助。

Answer 2

新发布的

tesseract

包可能值得一看。它允许您在 R 内部执行整个过程，而无需

shell

调用。

采用 tesseract

 包的

帮助文档中使用的过程，您的函数将如下所示：

lapply(myfiles, function(i){
  # convert pdf to jpef/tiff and perform tesseract OCR on the image

  # Read in the PDF
  pdf <- pdf_text(i)
  # convert pdf to tiff
  bitmap <- pdf_render_page(news, dpi = 300)
  tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
  # perform OCR on the .tiff file
  out <- ocr(paste0, (".tiff"))
  # delete tiff file
  file.remove(paste0(i, ".tiff" ))
})

Answer 3

这是可以考虑的另一种方法：

library(reticulate)
conda_Env <- conda_list()

if(any(conda_Env[, 1] == "ocrTable") == FALSE)
{
  reticulate::conda_create(envname = "ocrTable", python_version = "3.7.16")
  reticulate::conda_install(envname = "ocrTable", packages = "transformers", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "torch", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "requests", pip = TRUE)
  reticulate::conda_install(envname = "ocrTable", packages = "Pillow", pip = TRUE)
}

reticulate::use_condaenv("ocrTable")

transformers <- import("transformers")
TrOCRProcessor <- transformers$TrOCRProcessor
VisionEncoderDecoderModel <- transformers$VisionEncoderDecoderModel
processor <- TrOCRProcessor$from_pretrained("microsoft/trocr-base-handwritten")
model <- VisionEncoderDecoderModel$from_pretrained("microsoft/trocr-base-handwritten")

requests <- import("requests")
PIL <- import("PIL")
Image <- PIL$Image
url <- "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image <- Image$open(requests$get(url, stream = TRUE)$raw)$convert("RGB")
pixel_values <- processor(image, return_tensors = "pt")$pixel_values
generated_ids <- model$generate(pixel_values)
generated_text <- processor$batch_decode(generated_ids, skip_special_tokens = TRUE)
generated_text

[1] "industry, \" Mr. Brown commented icily. \" Let us have a"

使用 R 进行 OCR

问题描述投票：0回答：3

3个回答

最新问题

使用 R 进行 OCR

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3