我一直在尝试在R中进行OCR(读取PDF数据,其中数据为扫描图像)。一直在阅读有关此内容@http://electricarchaeology.ca/2014/07/15/doing-ocr-within-r/
这是一篇非常好的帖子。
有效的3个步骤:
根据链接帖子,上述 3 个步骤的有效代码:
lapply(myfiles, function(i){
# convert pdf to ppm (an image format), just pages 1-10 of the PDF
# but you can change that easily, just remove or edit the
# -f 1 -l 10 bit in the line below
shell(shQuote(paste0("F:/xpdf/bin64/pdftoppm.exe ", i, " -f 1 -l 10 -r 600 ocrbook")))
# convert ppm to tif ready for tesseract
shell(shQuote(paste0("F:/ImageMagick-6.9.1-Q16/convert.exe *.ppm ", i, ".tif")))
# convert tif to text file
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
# delete tif file
file.remove(paste0(i, ".tif" ))
})
前两步进展顺利。 (虽然花了很多时间,对于 4 页的 pdf,但稍后会研究可扩展性部分,首先尝试这是否有效)
运行此程序时,前两步工作正常。
运行第三步时,即
shell(shQuote(paste0("F:/Tesseract-OCR/tesseract.exe ", i, ".tif ", i, " -l eng")))
我遇到这个错误:
错误:求值嵌套太深:无限递归/选项(表达式=)?
或者 Tesseract 崩溃了。
任何解决方法或根本原因分析将不胜感激。
通过使用“tesseract”,我创建了一个可以运行的示例脚本。甚至它也适用于扫描的 PDF。
library(tesseract)
library(pdftools)
# Render pdf to png image
img_file <- pdftools::pdf_convert("F:/gowtham/A/B/invoice.pdf", format = 'tiff', dpi = 400)
# Extract text from png image
text <- ocr(img_file)
write.table(text, "F:/gowtham/A/B/mydata.txt")
我是 R 和编程新手。如有错误请指导。希望这对您有帮助。
新发布的
tesseract
包可能值得一看。它允许您在 R 内部执行整个过程,而无需 shell
调用。
采用 tesseract
包的帮助文档中使用的过程,您的函数将如下所示:
lapply(myfiles, function(i){
# convert pdf to jpef/tiff and perform tesseract OCR on the image
# Read in the PDF
pdf <- pdf_text(i)
# convert pdf to tiff
bitmap <- pdf_render_page(news, dpi = 300)
tiff::writeTIFF(bitmap, paste0(i, ".tiff"))
# perform OCR on the .tiff file
out <- ocr(paste0, (".tiff"))
# delete tiff file
file.remove(paste0(i, ".tiff" ))
})
这是可以考虑的另一种方法:
library(reticulate)
conda_Env <- conda_list()
if(any(conda_Env[, 1] == "ocrTable") == FALSE)
{
reticulate::conda_create(envname = "ocrTable", python_version = "3.7.16")
reticulate::conda_install(envname = "ocrTable", packages = "transformers", pip = TRUE)
reticulate::conda_install(envname = "ocrTable", packages = "torch", pip = TRUE)
reticulate::conda_install(envname = "ocrTable", packages = "requests", pip = TRUE)
reticulate::conda_install(envname = "ocrTable", packages = "Pillow", pip = TRUE)
}
reticulate::use_condaenv("ocrTable")
transformers <- import("transformers")
TrOCRProcessor <- transformers$TrOCRProcessor
VisionEncoderDecoderModel <- transformers$VisionEncoderDecoderModel
processor <- TrOCRProcessor$from_pretrained("microsoft/trocr-base-handwritten")
model <- VisionEncoderDecoderModel$from_pretrained("microsoft/trocr-base-handwritten")
requests <- import("requests")
PIL <- import("PIL")
Image <- PIL$Image
url <- "https://fki.tic.heia-fr.ch/static/img/a01-122-02.jpg"
image <- Image$open(requests$get(url, stream = TRUE)$raw)$convert("RGB")
pixel_values <- processor(image, return_tensors = "pt")$pixel_values
generated_ids <- model$generate(pixel_values)
generated_text <- processor$batch_decode(generated_ids, skip_special_tokens = TRUE)
generated_text
[1] "industry, \" Mr. Brown commented icily. \" Let us have a"