如何使用 APACHE TIKA 将 PDF 文件转换为带有样式的 HTML

问题描述 投票:0回答:1

我正在使用 apache tika 将 PDF 文件转换为 HTML,我需要提取具有粗体、斜体、顶部、左侧、高度、宽度和元素字体系列等样式的 html,但我以前只获取包含标签的原始 html 标签没有风格,

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.ExpandedTitleContentHandler;
import org.xml.sax.SAXException;
import com.google.common.io.Files;

public class Test4 {

    public static void main(String[] args) throws IOException, TransformerConfigurationException, SAXException, TikaException {
        byte[] file = Files.toByteArray(new File("src/main/java/test/test.pdf"));
        AutoDetectParser tikaParser = new AutoDetectParser();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
        TransformerHandler handler = factory.newTransformerHandler();
        handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
        handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
        handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        handler.setResult(new StreamResult(out));
        ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
        tikaParser.parse(new ByteArrayInputStream(file), handler1, new Metadata());
        System.out.println(new String(out.toByteArray(), "UTF-8"));

    }

}

在输出中我只得到 HTML,如下所示,

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2017-01-10T12:56:54Z"/>
<meta name="pdf:PDFVersion" content="1.5"/>
<meta name="pdf:docinfo:title" content="GROUP HEALTH INSURANCE"/>
<meta name="pdf:docinfo:created" content="2016-09-01T05:45:44Z"/>

<!-- some meta's here  -->
<title>GROUP HEALTH INSURANCE</title>
</head>
<body>
<div class="page">
<p/>
<p> Group Personal Accident Policy for Optional Travel Insurance for E-ticket </p>
<p>Sample Registration Number:102 

</p>
<p>Annexure 1 

</p>
</div>
</body>
</html>

我如何从PDF中提取具有自己样式的HTML内容,请命令你的答案,提前谢谢。

java html pdf pdfbox apache-tika
1个回答
0
投票

您可以使用 Pdf2Dom 库。不幸的是,该存储库没有进一步维护/更新,但可用的最新版本运行良好。它与 pdfbox 一起使用。

您可以使用如下代码片段将 Pdf 转换为 Dom


    public static Document parseWithPdfDomTree(InputStream is, int start, int end) throws Exception
    {
        PDDocument pdf = PDDocument.load(is);
        PDFDomTree parser = new PDFDomTree();
        parser.setStartPage(start);
        parser.setEndPage(end);

        Writer output = new StringWriter();
        parser.writeText(pdf, output);
        pdf.close();
        String htmlOutput = output.toString();

        return Jsoup.parse(htmlOutput);
    }

Pdf2Dom

© www.soinside.com 2019 - 2024. All rights reserved.