我的计算机上保存有一个 PDF 图像文件(例如“p1.pdf” - 这是原始文档的扫描副本) - 该文件看起来像这样(我添加了红线以显示区别):
我想将此 PDF 导入到 R 中,并将其转换为“表对象”。我尝试按照这个tutorial(+https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html)来做到这一点:
library(tesseract)
library(magick)
library(png)
library(pdftools)
library(tidyverse)
pngfile <- pdftools::pdf_convert('p1.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)
raw_img <- image_read(pngfile)
raw_img %>%
image_ocr()
这似乎有效 - 我继续按照教程进行操作:
num_only <- tesseract::tesseract(
options = list(tessedit_char_whitelist = c(".0123456789 "))
)
raw_img %>%
image_quantize(colorspace = 'gray') %>%
image_threshold() %>%
image_crop(geometry_area(100, 0, 600, 40)) %>%
ocr(engine = num_only)
combo <- tesseract::tesseract(
options = list(
tessedit_char_whitelist = paste0(
c(letters, LETTERS, " ", ".0123456789 (-)"), collapse = "")
))
raw_text <- raw_img %>%
image_quantize(colorspace = "gray") %>%
image_transparent("white", fuzz = 22) %>%
image_background("white") %>%
image_threshold() %>%
image_crop(geometry_area(0, 0, 110, 45)) %>%
ocr(engine = combo)
现在,我正在尝试将其转换为表格(“tibble”):
raw_tibble <- raw_text %>%
str_split(pattern = "\n") %>%
unlist() %>%
tibble(data = .)
# A tibble: 68 x 1
data
<chr>
1 "- ALPHABETICAL LISTING ABT"
2 "a PlaceYear of Reg"
3 "Name Address Graduation Year"
4 ""
5 "- (John Smith) BC ABC Uni~
6 ""
7 "Email: [email protected] P"999-999-~
8 "BCC University 2002"
9 "- Jane Smith HGH Univer~
10 "Email [email protected] "
# ... with 58 more rows
问题出在此处 - 教程中有关进一步整理结果的说明似乎非常特定于教程中的示例(即与足球相关)。因此,我无法将它们应用于我的问题。 有人可以告诉我如何将我得到的结果转换成看起来更接近原始 PDF 的表格吗?
也许有人可以尝试将我上传的图片保存为 png 文件,看看他们是否可以让这段代码工作?
谢谢!
注1:
401 Fake St Some City, Some State A1C B23 P: 555-555-5555 501 Fake St Some City, Some State A1C B23 p:666-666-6666 601 Fake St Some City, Some State A1C B23 p:666-666-6666
”可以全部显示为一行因此,决赛桌可能如下所示:
id Name Address
1 1 (John Smith) Email:[email protected] AB p:999-999-9999
2 2 Jane Smith Email:[email protected] p:111-111-1111
3 3 Henry Smith 201 Fake St,Some City, Some State A1C B23 P:111-222-3333
4 4 Jason Smith 301 Fake St Some City,Some State A1C B23 P:555-555-5555
5 5 Luke Smith [email protected] 401 Fake St Some City, Some State A1C B23 P: 555-555-5555 501 Fake St Some City, Some State A1C B23 p:666-666-6666 601 Fake St Some City, Some State A1C B23 p:666-666-6666
Place_Year_Graduation Reg_Year
1 ABC University 2001, BCC University 2002,DEF University 2003 2000
2 HGH University/2001, Some Other School 2002 2000
3 University ABC 1999
4 Univer 123 2005
5 ABC College 2010
final = structure(list(id = 1:5, Name = c(" (John Smith) Email:[email protected]",
"Jane Smith Email:[email protected]", "Henry Smith", "Jason Smith",
"Luke Smith [email protected]"), Address = c("AB p:999-999-9999",
"p:111-111-1111", "201 Fake St,Some City, Some State A1C B23 P:111-222-3333",
"301 Fake St Some City,Some State A1C B23 P:555-555-5555", "401 Fake St Some City, Some State A1C B23 P: 555-555-5555 501 Fake St Some City, Some State A1C B23 p:666-666-6666 601 Fake St Some City, Some State A1C B23 p:666-666-6666"
), Place_Year_Graduation = c("ABC University 2001, BCC University 2002,DEF University 2003",
"HGH University/2001, Some Other School 2002", "University ABC",
"Univer 123", "ABC College"), Reg_Year = c(2000, 2000, 1999,
2005, 2010)), class = "data.frame", row.names = c(NA, -5L))
注2:我知道这是一个难题,而且 R 可能不是解决这个问题的最合适工具 - 因此,我也愿意使用 Python 来解决这个问题。
这需要在 R 或 Python 中完成吗? Tabula https://tabula.technology/ 是专门为处理这个问题而编写的,它做得非常好,特别是在像你这样的基本表上。
虽然不是 R 特定的,但在这种情况下,我会尝试通过 pdf 服务(API)从 shell 命令行尽可能运行 tesseract,该服务应该重新生成并维护 pdf 文本布局。就我个人而言,我一次只运行几个文件,因为每个文件都需要一些编辑。
在这种情况下,它的小
[email protected]
应该是[email protected]
,但永远不要相信“数字”,当数字加起来不等于利润或施用错误的医疗剂量时,它可能会付出高昂的代价。
一旦你有了文本布局,它就与可搜索的 pdf 进行相同的处理,所以在这种情况下它
xpdf-tools-win-4.03\bin64>pdftotext -layout -table a.pdf&type a.txt
ALPHABETICAL LISTING ABT
Place/Year of Reg
Name Addresss Graduation Year
(John Smith) AB ABC University 2001
BCC University 2002 2000
Email: [email protected] P:999-999-9999
DEF University 2003
Jane Smith HGH University/2001
P: 111-111-111 2000
Email: [email protected] Some Other School 2002
201 FakeSt
Henry Smith Some City, Some State A1C B23 University ABC 1999
P:111-2223333
301 Fake St
Jason Smith Some City, Some State A1C B23 Univer 123 2005
P:555-555-5555
401 Fake St ABC College 2010
Luke Smith Some City, Some State A1C B23
[email protected] P: 666-666-6666
501 Fake St
Some City, Some State A1C B23
P:666-666-6666
601 Fake 5St
Some City, Some State A1C B23
P: 666-666-6666
文本导入到电子表格或其他您喜欢的方式应该重新解析到单元格中,这需要最少的表格重新调整。