如何从 PDF 文件转换的 XML 文件中识别实际空间或可见空间与其他类型的空间

将 PDF 文件转换为 xml 文件时，有时会在 XML 文件中发现单词或句子中间不需要的空格（不间断空格、零宽度空格等），但是这些空格在位于 XML 文件中时不可见。一个 pdf 文件，如果我只想通过忽略其他类型的空格而仅将 pdf 中的实际可见空格反映在转换后的 XML 文件中，该怎么办？

def convert(case, input_file_path, targetfilepath, pages=0):
    if not pages:
        pagenums = set();
    else:
        pagenums = set(100);
    manager = PDFResourceManager()
    codec = 'utf-8'
    caching = True

    if case == 'text':
        output = io.StringIO()
        converter = TextConverter(manager, output, codec=codec, laparams=LAParams())
    if case == 'HTML':
        output = io.BytesIO()
        converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())
    if case == 'XML':
        output = io.BytesIO()
        converter = XMLConverter(manager, output, codec=codec, laparams=LAParams())

    interpreter = PDFPageInterpreter(manager, converter)
    infile = open(input_file_path, 'rb')

    for page in PDFPage.get_pages(infile, pagenums, caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()

    infile.close();
    converter.close();
    output.close()

    convertedFile = open(targetfilepath, 'wb')
    convertedFile.write(convertedPDF)
    convertedFile.close()


(convert('XML', input_file_path, directory_path_xml1, pages=None))

这就是我将 PDF 转换为 XML 文件的方法

0
投票

如果您使用

pdfminer.6，这行代码应该创建一个有效的 XML 输出文件：

from pdfminer.high_level import extract_text_to_fp
from io import BytesIO

with open('PDFSample.pdf', 'rb') as pdf_file:
    xml_output = BytesIO()
    extract_text_to_fp(pdf_file, xml_output, output_type='xml')
    xml_output.seek(0)
    xml_content = xml_output.read()

with open('output.xml', 'wb') as output_file:
    output_file.write(xml_content)

xml_output.close()

问题描述投票：0回答：1

1个回答

最新问题

如何从 PDF 文件转换的 XML 文件中识别实际空间或可见空间与其他类型的空间

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1