我正在尝试编写一个程序,从 PDF 中提取文本,搜索关键字并突出显示它们,因此编写一个新的 pdf 并突出显示关键字。我不知道我是否需要提取文本然后编写一个新文本,或者我是否可以仅突出显示单词而不提取它们。我需要保留文本格式,我尝试使用reportlab,但它提取了文本并丢失了文本格式。我是编程新手,所以也许解决问题很容易,但我没有技巧。
我是一名电气工程师,需要阅读很多技术规范,例如 IEC 或 NBR(巴西版本的 IEC),所以如果我有这个代码,它将对我有很大帮助
这是我到目前为止编写的代码:
import PyPDF2
# Abre o arquivo PDF
pdf_file = r"C:\\Users\\pietro\\Desktop\\Projects\\espectest.pdf"
words = \["Teste"\]
# Cria um objeto PDFReader para o arquivo PDF aberto
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()
# Pega o número de páginas do PDF
num_pages = len(pdf_reader.pages)
# Cria uma lista zerada
pages=\[\]
# Obtém o texto da página atual do PDF
for i in range(num_pages):
page=pdf_reader.pages\[i\]
texto = pdf_reader.pages\[i\].extract_text()
pages.append(page)
pdf_writer.add_page(page)
\#---------------------------------------------------------------------------------------
#here i need to discover how to highlight words and write them on the new file
#----------------------------------------------------------------------------------------
# Imprime o texto da página atual do PDF
pdf_writer.write("especteste123.pdf")
I've tried PyPDF2, reportlab, Fitz, PDFPlumber
使用 PyMuPDF。
import fitz # PyMuPDF
my_keywords = ["kw1", "kw2", "kw3"]
doc = fitz.open("input.pdf") # the PDF
for page in doc: # iterate over the pages
for kw in my_keywords: # iterate over the keywords
rectlist = page.search_for(kw) # locate keyword on page
for rect in rectlist: # iterate over its occurences
page.add_highlight_annot(rect) # highlight it
doc.save("output.pdf")
安装PyMuPDF
python -m pip install --upgrade pymupdf
这是源代码:
import fitz # PyMuPDF
def extract_highlighted_text(pdf_path):
highlighted_text = []
# Open the PDF file
pdf_document = fitz.open(pdf_path)
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
print('Page -> ', page)
print('Text -> ', page.get_text())
# Get all the annotations on the page
annotations = page.annots()
for annot in annotations:
print(annot)
# Check if the annotation is a highlight
if annot.type[0] == 8: # 8 corresponds to a highlight annotation in PyMuPDF
highlight_text = annot.info["subject"]
highlighted_text.append(highlight_text)
# Close the PDF document
pdf_document.close()
return highlighted_text
# Usage example
pdf_path = 'INPUT_FILE.pdf'
highlighted_text = extract_highlighted_text(pdf_path)
for text in highlighted_text:
print(text)
了解更多: