使用 Python 从 PDF 中提取文本 - 突出显示

Question

我正在尝试编写一个程序，从 PDF 中提取文本，搜索关键字并突出显示它们，因此编写一个新的 pdf 并突出显示关键字。我不知道我是否需要提取文本然后编写一个新文本，或者我是否可以仅突出显示单词而不提取它们。我需要保留文本格式，我尝试使用reportlab，但它提取了文本并丢失了文本格式。我是编程新手，所以也许解决问题很容易，但我没有技巧。

我是一名电气工程师，需要阅读很多技术规范，例如 IEC 或 NBR（巴西版本的 IEC），所以如果我有这个代码，它将对我有很大帮助

这是我到目前为止编写的代码：

import PyPDF2

# Abre o arquivo PDF

pdf_file = r"C:\\Users\\pietro\\Desktop\\Projects\\espectest.pdf"

words = \["Teste"\]

# Cria um objeto PDFReader para o arquivo PDF aberto

pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()

# Pega o número de páginas do PDF

num_pages = len(pdf_reader.pages)

# Cria uma lista zerada

pages=\[\]

# Obtém o texto da página atual do PDF

for i in range(num_pages):
page=pdf_reader.pages\[i\]
texto = pdf_reader.pages\[i\].extract_text()
pages.append(page)
pdf_writer.add_page(page)
\#---------------------------------------------------------------------------------------

#here i need to discover how to highlight words and write them on the new file

#----------------------------------------------------------------------------------------

# Imprime o texto da página atual do PDF

pdf_writer.write("especteste123.pdf")

I've tried PyPDF2, reportlab, Fitz, PDFPlumber

Answer 1

使用 PyMuPDF。

import fitz  # PyMuPDF

my_keywords = ["kw1", "kw2", "kw3"]
doc = fitz.open("input.pdf")  # the PDF
for page in doc:  # iterate over the pages
    for kw in my_keywords:  # iterate over the keywords
        rectlist = page.search_for(kw)  # locate keyword on page
        for rect in rectlist:  # iterate over its occurences
            page.add_highlight_annot(rect)  # highlight it

doc.save("output.pdf")

Answer 2

安装PyMuPDF

python -m pip install --upgrade pymupdf

这是源代码：

import fitz  # PyMuPDF

def extract_highlighted_text(pdf_path):
    highlighted_text = []

    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    for page_num in range(pdf_document.page_count):
        page = pdf_document[page_num]
        print('Page -> ', page)
        print('Text -> ', page.get_text())
        
        # Get all the annotations on the page
        annotations = page.annots()

        for annot in annotations:
            print(annot)
            # Check if the annotation is a highlight
            if annot.type[0] == 8:  # 8 corresponds to a highlight annotation in PyMuPDF
                highlight_text = annot.info["subject"]
                highlighted_text.append(highlight_text)

    # Close the PDF document
    pdf_document.close()

    return highlighted_text

# Usage example
pdf_path = 'INPUT_FILE.pdf'
highlighted_text = extract_highlighted_text(pdf_path)

for text in highlighted_text:
    print(text)

了解更多：

1。有关提取文档文本的更多信息

2。标记提取的文本

3.注释

使用 Python 从 PDF 中提取文本 - 突出显示

问题描述投票：0回答：2

2个回答

最新问题

使用 Python 从 PDF 中提取文本 - 突出显示

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2