我需要使用 java 或 python 将包含内部表格的 pdf 转换为 docx。 当我使用 java 时,我只得到文本
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.poi.xwpf.usermodel.*;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
public class PdfToDocxConverter {
public static void main(String[] args) {
String pdfPath = "BFI2201.pdf";
String docxPath = "path_to_save_docx.docx";
try {
PDDocument document = PDDocument.load(new FileInputStream(pdfPath));
PDFTextStripper textStripper = new PDFTextStripper();
String pdfText = textStripper.getText(document);
document.close();
// Split the extracted text into rows
String[] rows = pdfText.split("\n");
// Create a new DOCX document
XWPFDocument docxDocument = new XWPFDocument();
XWPFTable table = docxDocument.createTable(rows.length, 5);
// Fill the table with the extracted data
for (int i = 0; i < rows.length; i++) {
String[] cells = rows[i].split("\t");
for (int j = 0; j < cells.length; j++) {
XWPFTableCell cell = table.getRow(i).getCell(j);
cell.setText(cells[j]);
}
}
// Save the DOCX document
FileOutputStream outputStream = new FileOutputStream(docxPath);
docxDocument.write(outputStream);
docxDocument.close();
outputStream.close();
System.out.println("Table extracted from PDF and saved as DOCX successfully!");
} catch (IOException e) {
e.printStackTrace();
}
}
}
当我通过 python 执行此操作时,我得到的 docx 很糟糕
from pdf2docx import Converter
def convert(name):
cv = Converter(name+".pdf")
cv.convert(name+".docx", start=0, end=None)
cv.close()
转换后,我从 docx 解析表。
https://www.ilovepdf.com/pdf_to_word - 做得很好,但我不能重复