如何在java\python中将pdf转换为docx?

问题描述 投票:0回答:1

我需要使用 java 或 python 将包含内部表格的 pdf 转换为 docx。 当我使用 java 时,我只得到文本

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.xwpf.usermodel.*;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

public class PdfToDocxConverter {

    public static void main(String[] args) {
        String pdfPath = "BFI2201.pdf";
        String docxPath = "path_to_save_docx.docx";


        try {
            PDDocument document = PDDocument.load(new FileInputStream(pdfPath));
            PDFTextStripper textStripper = new PDFTextStripper();
            String pdfText = textStripper.getText(document);
            document.close();

            // Split the extracted text into rows
            String[] rows = pdfText.split("\n");

            // Create a new DOCX document
            XWPFDocument docxDocument = new XWPFDocument();
            XWPFTable table = docxDocument.createTable(rows.length, 5);

            // Fill the table with the extracted data
            for (int i = 0; i < rows.length; i++) {
                String[] cells = rows[i].split("\t");
                for (int j = 0; j < cells.length; j++) {
                    XWPFTableCell cell = table.getRow(i).getCell(j);
                    cell.setText(cells[j]);
                }
            }

            // Save the DOCX document
            FileOutputStream outputStream = new FileOutputStream(docxPath);
            docxDocument.write(outputStream);
            docxDocument.close();
            outputStream.close();

            System.out.println("Table extracted from PDF and saved as DOCX successfully!");
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

当我通过 python 执行此操作时,我得到的 docx 很糟糕

from pdf2docx import Converter

def convert(name):
    cv = Converter(name+".pdf")
    cv.convert(name+".docx", start=0, end=None)
    cv.close()

转换后,我从 docx 解析表。

https://www.ilovepdf.com/pdf_to_word - 做得很好,但我不能重复

enter image description here

当我做某事时我得到了 enter image description here

或(蟒蛇) enter image description here

python java pdf converters docx
1个回答
0
投票

pdf2docx 应该能够转换表格,尽管可能会遇到更复杂的布局:

from pdf2docx import parse

pdf_input = 'input.pdf'
docx_output = 'output.docx'

parse(pdf_input, docx_output)

这个问题使用其他库有一些答案

© www.soinside.com 2019 - 2024. All rights reserved.