我需要一个自动代码来提取 R 中的 pdf 表格
所以我搜索了网站,找到了 tabulizer 包。
我用
extract_tables(f2,pages = 25,guess=TRUE,encoding = 'UTF-8',method="stream")#f2 is pdf file name
我尝试了每种方法类型,但结果并不整洁。
一些列是混合的,有很多空白,你可以看到图像文件。
我想我会直接修改数据。但目的是将其自动化。所以需要通用的方法。而且每个 pdf 文件都没有组织。有些表格非常整洁,每条相关的行都完美匹配,但其他表格则不然。 正如您在我的结果图像中看到的那样,在第 4 列中,数字混合在同一列中。其他列,数字一一匹配我的意思是我想自动使列像pdf中的表格一样整洁。
是否有任何包或一些方法可以使提取的表格整洁?
使用以下代码,我已经能够提取表格中的数字。首先,我将图像转换为 PDF 文件。之后,我将 PDF 文件转换为 word 文件。我终于从word文件中提取了表格。此解决方案仅适用于 Windows。
library(RDCOMClient)
library(magick)
path_PDF <- "D:\\image_Stackoverflow79.pdf"
path_PNG <- "D:\\Dropbox\\Reponses_Stackoverflow\\image_Stackoverflow79.png"
path_Word <- "D:\\image_Stackoverflow79.docx"
pdf(path_PDF, height = 8, width = 6)
im <- image_read(path_PNG)
im <- image_crop(im, geometry = geometry_area(width = 510, height = 310, x_off = 100, y_off = 110))
plot(im)
dev.off()
wordApp <- COMCreate("Word.Application")
wordApp[["Visible"]] <- TRUE
wordApp[["DisplayAlerts"]] <- FALSE
doc <- wordApp[["Documents"]]$Open(normalizePath(path_PDF),
ConfirmConversions = FALSE)
doc$SaveAs2(path_Word)
nb_Row <- doc$tables(1)$Rows()$Count()
nb_Col <- doc$tables(1)$Columns()$Count()
mat_Temp <- matrix(NA, nrow = nb_Row, ncol = nb_Col)
for(i in 1 : nb_Row)
{
for(j in 1 : nb_Col)
{
mat_Temp[i, j] <- tryCatch(doc$tables(1)$cell(i, j)$range()$text(), error = function(e) NA)
}
}
mat_Temp
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "\r\a" "\r\a" "\r\a" "\r\a" "\r\a" "\r\a" "\r\a" "\r\a"
[2,] "\r\a" "0.46\r\a" "0.46\r\a" "0.46\r\a" "0.46\r\a" "0.46\r\a" "0.46\r\a" "\r\a"
[3,] "\r\a" "1.00\r\a" "0.00\r\a" "0.98\r\a" "0.03\r\a" "0.95\r\a" "0.85\r\a" NA
[4,] "\r\a" "0.025\r\a" "0.025\r\a" "0.025\r\a" "0.025\r\a" "0.025\r\a" "0.025\r\a" NA
[5,] "\r\a" "0.005\r\a" "0.005\r\a" "0.005\r\a" "0.005\r\a" "0.005\r\a" "0.005\r\a" NA
[6,] "\r\a" "1.49\r\a" "0.49\r\a" "1.47\r\a" "0.52\r\a" "1.44\r\a" "1.34\r\a" "\r\a"
[7,] "\r\a" "0.002\r\a" "0.002\r\a" "0.002\r\a" "0.002\r\a" "0.002\r\a" "0.002\r\a" "\r\a"
[8,] "\r\a" "1.492\r\a" "0.492\r\a" "1472\r\a" "0.522\r\a" "1.442\r\a" "1.342\r\a" "\r\a"
[9,] "\r\a" "1.59\r\a" "\r\a" "1.22\r\a" "\r\a" "\r\a" "\r\a" "\r\a"
[10,] "\r\a" "1.493\r\a" "0.493\r\a" "1473\r\a" "0.523\r\a" "1.443\r\a" "1.343\r\a" "\r\a"
[11,] "\r\a" "0.107\r\a" "o. 108\r\a" "o. 105\r\a" "0.108\r\a" "0.106\r\a" "0.104\r\a" "\r\a"
[12,] "\r\a" "\r\a" "\r\a" NA NA NA NA NA
通过这种方法,数字似乎在好的列中。