VBA - 从扫描的 PDF 中获取文本并将其保存在 Excel 中

Question

我有一个非常具体的问题。我有一个从 PDF 文件中提取文本并将其保存在 Excel 中的代码。问题是由于文本阅读问题，它不适用于扫描的 pdf 文件。

我的代码执行以下操作：

打开 PDF
获取页面并突出显示页面中的文本
将其保存为变体
运行上述变体并写入完成每个单词（这是基本代码，现在它只写下特定值）
关闭 PDF

我希望它也适用于扫描的 PDF。我认为问题在于它无法突出显示文本，因为它更像是保存为 PDF 的图片而不是真正的书面 PDF。这是我的代码（我也没有制作这个代码，而是在互联网上找到的）：

Public Function Get_VIN_From_CoC(PDF_File As String, OnWhichPage As Integer) As String

'This procedure get the PDF data into excel by following way
'1.Open PDF file
'2.Looping through pages
'3.get the each PDF page data

Dim AC_PD As Acrobat.AcroPDDoc 'access pdf file
Dim AC_Hi As Acrobat.AcroHiliteList 'set selection word count
Dim AC_PG As Acrobat.AcroPDPage 'get the particular page
Dim AC_PGTxt As Acrobat.AcroPDTextSelect 'get the text of selection area
Dim Ct_Page As Long 'count pages in pdf file
Dim j As Long, K As Long 'looping variables
Dim T_Str As String
Dim Hld_Txt As Variant 'get PDF total text into array
Dim VIN As String

Set AC_PD = New Acrobat.AcroPDDoc
Set AC_Hi = New Acrobat.AcroHiliteList

'set maximum selection area of PDF page
AC_Hi.Add 0, 32767

With AC_PD
    'open PDF file
    .Open PDF_File
    'get the number of pages of PDF file
    Ct_Page = .GetNumPages
    'if get pages is failed exit sub
    If Ct_Page = -1 Then
        MsgBox "Pages Cannot determine in PDF file '" & PDF_File & "'"
        .Close
        GoTo h_end
    End If

    T_Str = ""
    'get the page
    Set AC_PG = .AcquirePage(OnWhichPage)
    
    'get the full page selection
    Set AC_PGTxt = AC_PG.CreateWordHilite(AC_Hi)
    
    'if text selected successfully get the all the text into T_Str string
    If Not AC_PGTxt Is Nothing Then
        With AC_PGTxt
            For j = 0 To .GetNumText - 1
                T_Str = T_Str & .GetText(j)
            Next j
        End With
    End If


    'get the PDF data into each sheet for each PDF page
    'if text accessed successfully then split T_Str by VbCrLf
    'and get into array Hld_Txt and looping through array and fill sheet with PDF data
    If T_Str <> "" Then
        Hld_Txt = Split(T_Str, vbCrLf)
        For K = 0 To UBound(Hld_Txt)
            T_Str = CStr(Hld_Txt(K))
            If Left(T_Str, 1) = "=" Then T_Str = "'" & T_Str
            MsgBox T_Str
            If Right(T_Str, 6) = "(Kg) :" Then VIN = CStr(Hld_Txt(K + 1))
                
        Next K
    Else
        'information if text not retrive from PDF page
        MsgBox "No text found in page "
    End If
    
.Close
End With

h_end:
Set AC_PGTxt = Nothing
Set AC_PG = Nothing
Set AC_Hi = Nothing
Set AC_PD = Nothing

Get_VIN_From_CoC = VIN

End Function

你能帮我解决这个问题吗？

Answer 1

正如 @Panka Balint 提到的，您需要先执行 OCR。我认为 VBA 不是执行此任务的理想语言。更有前途的方法是使用 Python 和以下代码：

from PIL import Image

导入pytesseract 从pdf2image导入convert_from_path

扫描的 PDF 文件的路径

pdf_path = '路径/到/您的/scanned_file.pdf'

将扫描的 PDF 转换为图像

图像 = 从路径转换（pdf_路径）

使用 Tesseract 迭代图像并提取文本

对于 i，枚举中的图像（图像）： text = pytesseract.image_to_string(image, lang='eng') # 'eng' 代表英语 print(f'图像 {i+1} 中的文本： {文本} ')

VBA - 从扫描的 PDF 中获取文本并将其保存在 Excel 中

问题描述投票：0回答：1

1个回答

扫描的 PDF 文件的路径

将扫描的 PDF 转换为图像

使用 Tesseract 迭代图像并提取文本

最新问题

VBA - 从扫描的 PDF 中获取文本并将其保存在 Excel 中

问题描述 投票：0回答：1

1个回答

扫描的 PDF 文件的路径

将扫描的 PDF 转换为图像

使用 Tesseract 迭代图像并提取文本

最新问题

问题描述投票：0回答：1