将编辑好的PDF转换成TXT

Question

美好的一天社区，

我正在尝试编译一些代码将 PDF 转换为文本，但结果不是我所期望的。我尝试了不同的库，如 pytesseract、pdfminer、pdftotext、pdf2image 和 OpenCV，但它们都提取文本不完整或有错误。我使用的最后两个代码是：

def convert_pdf_to_txt(path):
  text = extract_text(path)
  return text

# Change the file path according to the location of your PDF file
pdf_path = '/content/drive/MyDrive/PDF/file.pdf'

# Convert the PDF to text
text = convert_pdf_to_txt(pdf_path)

# Write the text to a file
with open('extracted_text.txt', 'w') as file:
file.write(text)

# Print a confirmation message
print('The text has been saved to the "extracted_text.txt" file.')

但是，当我使用在线 PDF 到文本转换器时，转换结果非常好，几乎完美，没有我在两个代码中遇到的错误。在这里，我附上了我想要转换为文本的 PDF 以及我尝试转换我的文件时从这两个代码中获得的结果。

这些是随附的文件：

https://anonfiles.com/P09bnen5z6/file_pdf https://anonfiles.com/g7Aan6n5ze/Archive_txt

我正在尝试编译一些代码将 PDF 转换为文本，但结果不是我所期望的。我尝试了不同的库，如 pytesseract、pdfminer、pdftotext、pdf2image 和 OpenCV，但它们都提取文本不完整或有错误。我使用的最后两个代码是：

Answer 1

使用

PyPDF2

的

PdfReader

：

!pip install PyPDF2
from PyPDF2 import PdfReader

reader = PdfReader('/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/PDF/some_word_doc_as_pdf.pdf')
page = reader.pages[0]
extracted_text = page.extract_text()

和

numpy

的

savetxt

：

file_name = '/content/drive/My Drive/Colab Notebooks/DATA_FOLDERS/PDF/new_word_doc_as_pdf.pdf'
import numpy as np
np.savetxt(file_name, [extracted_text], fmt='%s')

将编辑好的PDF转换成TXT

问题描述投票：0回答：1

1个回答

最新问题

将编辑好的PDF转换成TXT

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1