使用 xdocreport 将 utf-8 的 .doc 转换为 html

Question

我使用以下代码将 doc 文件转换为 html：

   public static byte[] generateHTMLFromDoc(byte[] docBytes) {
        try(ByteArrayInputStream inputStream = new ByteArrayInputStream(docBytes);
            ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
            XWPFDocument document = new XWPFDocument(inputStream);
            XHTMLOptions options = XHTMLOptions.create();
            Base64ImageExtractor imageExtractor = new Base64ImageExtractor();
            options.setExtractor(imageExtractor);
            options.URIResolver(imageExtractor);
            XHTMLConverter.getInstance().convert(document, outputStream, options);
            return outputStream.toByteArray();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return null;
    }

但是我的文件包含 utf-8，所以它会显示为 ???如：m?c l?c 请帮助我如何将 utf-8 添加到选项中

我尝试添加

options.setEncoding("UTF-8");

但是XHTMLOptions没有setEncoding

Answer 1

根据您的代码，我猜您正在使用非常旧版本的 XDocReport 和 Apache POI。所以我建议更新到最新版本。

当前的 XDocReport 版本 2.0.4 已经提供了

ImageManager

Base64EmbedImgManager

。所以不需要特殊的

Base64ImageExtractor

。

以下内容对我有用，并且

WordDocument.docx

中的 Unicode 没有问题。

import java.io.*;

//needed jars: xdocreport-2.0.4.jar, 
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager ;

//needed jars: all apache poi dependencies of poi-ooxml version 5.2.3
import org.apache.poi.xwpf.usermodel.*;

public class DOCXToXHTMLXDocReport {

 public static void main(String[] args) throws Exception {

  String docPath = "./WordDocument.docx";
  String htmlPath = "./WordDocument.html";

  XWPFDocument document = new XWPFDocument(new FileInputStream(docPath));

  XHTMLOptions options = XHTMLOptions.create().setImageManager(new Base64EmbedImgManager());
  
  FileOutputStream out = new FileOutputStream(htmlPath);
  XHTMLConverter.getInstance().convert(document, out, options);

  out.close();      
  document.close();    

  java.awt.Desktop.getDesktop().browse(new File(htmlPath).toPath().toRealPath(java.nio.file.LinkOption.NOFOLLOW_LINKS).toUri());  
 
 }
}

但也许你的问题不在于生成字节数组，而在于你稍后对此字节数组所做的事情。也许生成的字节数组包含正确的 Unicode，但稍后使用该字节数组的程序无法正确处理它们？

使用 xdocreport 将 utf-8 的 .doc 转换为 html

问题描述投票：0回答：1

1个回答

最新问题

使用 xdocreport 将 utf-8 的 .doc 转换为 html

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1