将 PDF 文件转换为 xml 文件时,有时会在 XML 文件中发现单词或句子中间不需要的空格(不间断空格、零宽度空格等),但是这些空格在位于 XML 文件中时不可见。一个 pdf 文件,如果我只想通过忽略其他类型的空格而仅将 pdf 中的实际可见空格反映在转换后的 XML 文件中,该怎么办?
def convert(case, input_file_path, targetfilepath, pages=0):
if not pages:
pagenums = set();
else:
pagenums = set(100);
manager = PDFResourceManager()
codec = 'utf-8'
caching = True
if case == 'text':
output = io.StringIO()
converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
if case == 'HTML':
output = io.BytesIO()
converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())
if case == 'XML':
output = io.BytesIO()
converter = XMLConverter(manager, output, codec=codec, laparams=LAParams())
interpreter = PDFPageInterpreter(manager, converter)
infile = open(input_file_path, 'rb')
for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
interpreter.process_page(page)
convertedPDF = output.getvalue()
infile.close();
converter.close();
output.close()
convertedFile = open(targetfilepath, 'wb')
convertedFile.write(convertedPDF)
convertedFile.close()
(convert('XML', input_file_path, directory_path_xml1, pages=None))
这就是我将 PDF 转换为 XML 文件的方法
pdfminer.6,这行代码应该创建一个有效的 XML 输出文件:
from pdfminer.high_level import extract_text_to_fp
from io import BytesIO
with open('PDFSample.pdf', 'rb') as pdf_file:
xml_output = BytesIO()
extract_text_to_fp(pdf_file, xml_output, output_type='xml')
xml_output.seek(0)
xml_content = xml_output.read()
with open('output.xml', 'wb') as output_file:
output_file.write(xml_content)
xml_output.close()