我正在使用 apache tika 将 PDF 文件转换为 HTML,我需要提取具有粗体、斜体、顶部、左侧、高度、宽度和元素字体系列等样式的 html,但我以前只获取包含标签的原始 html 标签没有风格,
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.TransformerConfigurationException;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.sax.ExpandedTitleContentHandler;
import org.xml.sax.SAXException;
import com.google.common.io.Files;
public class Test4 {
public static void main(String[] args) throws IOException, TransformerConfigurationException, SAXException, TikaException {
byte[] file = Files.toByteArray(new File("src/main/java/test/test.pdf"));
AutoDetectParser tikaParser = new AutoDetectParser();
ByteArrayOutputStream out = new ByteArrayOutputStream();
SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
handler.setResult(new StreamResult(out));
ExpandedTitleContentHandler handler1 = new ExpandedTitleContentHandler(handler);
tikaParser.parse(new ByteArrayInputStream(file), handler1, new Metadata());
System.out.println(new String(out.toByteArray(), "UTF-8"));
}
}
在输出中我只得到 HTML,如下所示,
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2017-01-10T12:56:54Z"/>
<meta name="pdf:PDFVersion" content="1.5"/>
<meta name="pdf:docinfo:title" content="GROUP HEALTH INSURANCE"/>
<meta name="pdf:docinfo:created" content="2016-09-01T05:45:44Z"/>
<!-- some meta's here -->
<title>GROUP HEALTH INSURANCE</title>
</head>
<body>
<div class="page">
<p/>
<p> Group Personal Accident Policy for Optional Travel Insurance for E-ticket </p>
<p>Sample Registration Number:102
</p>
<p>Annexure 1
</p>
</div>
</body>
</html>
我如何从PDF中提取具有自己样式的HTML内容,请命令你的答案,提前谢谢。
您可以使用 Pdf2Dom 库。不幸的是,该存储库没有进一步维护/更新,但可用的最新版本运行良好。它与 pdfbox 一起使用。
您可以使用如下代码片段将 Pdf 转换为 Dom
public static Document parseWithPdfDomTree(InputStream is, int start, int end) throws Exception
{
PDDocument pdf = PDDocument.load(is);
PDFDomTree parser = new PDFDomTree();
parser.setStartPage(start);
parser.setEndPage(end);
Writer output = new StringWriter();
parser.writeText(pdf, output);
pdf.close();
String htmlOutput = output.toString();
return Jsoup.parse(htmlOutput);
}