使用PDFminer(python)无法在PDF中找到正则表达式

问题描述 投票:0回答:1

我正在尝试在简短的pdf中查找正则表达式的出现。但是,它不起作用。我不明白为什么,因为如果我尝试搜索一个简单的字符串,我不会遇到任何问题。文本正确呈现。这是我的代码:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO
import re

def convert_pdf_to_txt(path):
    #\[\s*prima(?!\S)regex = re.compile(r"\[(\s)prima(?!\S)")

    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):

        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    reg = re.compile(r"\[(\s)prima(?!\S)")
    matches = re.findall(reg, text)
    return matches


print(convert_pdf_to_txt("fel_split.pdf"))

这是我的正则表达式:(r"\[(\s)prima(?!\S)")我想找到“ [prima“。

python regex pdf findall pdfminer
1个回答
-1
投票

也许有多个空格字符?试试这个:r"\]\s+prima\s+"

© www.soinside.com 2019 - 2024. All rights reserved.